Boost logo

Boost :

Subject: Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter
From: Artyom Beilis (artyomtnk_at_[hidden])
Date: 2011-08-13 03:24:05


> > It does not provide traits for non-Unicode encodings
> > like lets say Shift-JIS or ISO-8859-8
>
>
> The library is designed to be flexible without intending to
> include
> every possible encodings to the library by default. The
> point is that
> external developers can leverage the EncodingTraits
> template parameter
> to implement the desired encoding *themselves*.

I understand and I do not expect from you implementing
every encoding, but some encodings lets say Latin1 or ASCII
should be given to at least provide an example.

>
> > BTW you can't create traits for many encodings, for
> > example you can't implement traits requirements:
> >
> > http://crf.scriptmatrix.net/ustr/ustr/advanced.html#ustr.advanced.custom_encoding_traits
> >
> >
> > For popular encodings like Shift-JIS or GBK...
> >
> > Homework: tell me why ;-)
>

>
> Perhaps you are referring to the non-roundtrip conversion
> of some
> Shift-JIS characters, as mentioned by Microsoft at
> http://support.microsoft.com/kb/170559.

No

> Or perhaps you mean that you can't properly encode
> non-Japanese
> characters into the Shift-JIS encoding

No

The problem, that unlike variable width UTF encodings
that have clear separation between the lead and
the trail code units that multi-byte encodings
like Shift-JIS or GBK have not such separation.

UTF-8 and UTF-16 encodings are so called
self synchronizing, you can go forward,
you can go backward without any problem
and even if you lost the position you
can find the next valid position in either
direction and continue.

However with non-Unicode CJK encodings
like Shift-JIS or GBK there is no
way to go backward because it is ambiguous,
and in order to decode text you always
should go forward.

That is why the traits model you provided
has conceptual flaw as it is impossible
to implement bidirection iterator
over most non UTF CJK multibyte encodings.

>
> It is always possible to add a dynamic layer on top of the
> static
> encoding layer but not the other way round. It shouldn't be
> too hard

Is encoding traits object is a member of unicode_string_adapter?
>From what I had seen in the code it does not - correct me if I wrong.

So in current situation you can't make dynamic encoding traits.

> to write a class with virtual interfaces to call the proper
> template
> instance of unicode_string_adapter. But that is currently
> outside of
> the scope and the fundamental design will still be static
> regardless.
>

This is a problem as for example typical use case where ANSI
code page is used as default encodings is something that
is defined by the OS the program runs on.

> > If someone uses strings with different encodings he
> usually
> > knows their encoding...
> >
> > The problem is that API inconsistent as on Windows
> narrow
> > string is some ANSI code page and anywhere else it is
> UTF-8.
> >
> > This is entirely different problem and such adapters
> don't
> > really solve them but actually make it worse...
>
> If I'm not wrong however, the wide version of strings on
> Windows is
> always UTF-16 encoded, am I correct?
>

Yes, but wide characters useless for cross platform development
as they UTF-16 only on Windows on other OSes they are UTF-32

>
> So while it's hard to construct UTF-8 string literals on
> Windows, it
> should still be possible by writing code that manually
> insert UTF-8
> code units into std::string. After all std::string does not
> have any
> restriction on which bytes we can manually insert into it.
>

Sorry but IMHO this is quite ugly solution...

> Perhaps you are talking about printing the string out to
> std::out, but
> that's another hard problem that we have to tackle
> separately.

Actually this is quite simple even on Windows.

In console write chcp 65001 and then UTF-8 output
will be shown correctly.

> The unicode_string_adapter class has a well defined
> role which
> is to offer *code point* level access.

No problem with that but shouldn't it be maybe
code_point iterator and not "String-Adaoptor"?

> I did plan to write
> another
> class that does the character iteration, but I don't have
> enough time
> to do it yet.
>

If you implement character iteration it will not be
compact library as character iteration
requires access to Unicode database of
character properties.

And actually Boost.Locale already provides character
iteration.

http://svn.boost.org/svn/boost/trunk/libs/locale/doc/html/namespaceboost_1_1locale_1_1boundary.html

And Boost.Unicode should provide grapheme segmentation
with different interface.

> But I believe you can pass the code point iterators to
> Mathias'
> Boost.Unicode methods to create abstract character
> iterators that does
> the job you want.
>

I see, this is better.

>
> I added a static method
> unicode_string_adapter::make_codepoint_iterator to do what
> you have
> requested. It accepts three code unit iterator parameters,
> current,
> begin, and end, so that it can iterate in both directions
> without
> going out of bound. Hope that is what you looking for.
>

Yes, this is better, but it should not be a static method
of string_adaptor but a free algorithm.

>
> It is ok if you disagree with my approach, but keep in mind
> that the
> focus of this thread is just to tell whether my current
> proposed
> solution is good, and not to propose a better solution.
>
> Thanks.
>

I understand it and I pointed to the problems with string_adaptor.

But also I question the usability of the library and motivations.

Best,
  Artyom


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk