Boost logo

Boost :

From: Rogier van Dalen (rogiervd_at_[hidden])
Date: 2005-03-17 18:30:51


Just some short answers; I don't have much time at the moment:

> > Of course there isn't much documentation yet, but now that the library
> > is out in the open, writing a Unicode primer might be a good thing to
> > do now. [...]
> [...] we
> will have to write some paragraphs about this for our report too, so we
> might as well do it ourself. No need for two people doing the same thing.

Great! I look forward to seeing it.

> [...] We
> thought about using multi_index but (correct me if I'm wrong) wouldn't
> that make is neccessary to fill the database at run-time? With
> serialization, the overhead would probably not be that bad, but still.
I think so, but I think hashing might give an enormous runtime
performance gain. I'm not particularly knowledgeable in this area,
just throwing in ideas on this.

> I have an idea on how to implement iostream support in the library
> (Wrote it in another mail here), but I'm not really sure if it would
> work. Could you perhaps verify that?
What you're saying sounds correct to me.
http://groups.yahoo.com/group/boost/files/utf/ has utf-2003-01-12.zip.
I have no idea what its status is but it seems to implement all kinds
of UTF I/O you'll need. There's even a detect_from_bom.hpp which
appears to check for a BOM and imbue the correct codecvt.

> > I also think you should separate code points and Unicode characters.
> > In normal situations, the user should not have to deal with code
> > points. The discussion should not focus on that for now; it's an
> > implementation detail. I strongly object to your
> >
> > typedef encoded_string<unicode_tag> unicode_string;
> >
> > because I think a Unicode string should contain characters. For
> > example, a regular expression on Unicode strings should support level
> > 2 (see <http://www.unicode.org/reports/tr18/>). Why go for anything
> > less?
>
> What exactly do mean by the term "character"? Abstract characters?

Yes. IMHO:
One of the goals of the Unicode library is to relieve programmers of
having to know all ins and outs of Unicode. In my opinion, the average
programmer should not need to know about code points and normalisation
forms. Finding strings in other strings should just work, without
having to mess with normalisation forms. Starting a string with a
combining character should throw, because it's meaningless (and may
cause hard-to-diagnose errors later on). The third character in the
string "rôle" is 'l', and not either-an-l-or-a-combining-character.
Code points should be hidden away in the "advanced topics" section of
the library.

I'm so totally convinced of this I have a hard time seeing why it
should be otherwise. Do you, or anyone else, feel there is anything
obvious I'm missing?

> I do agree with you on the level 2 support though. The closer the
> behaviour of a string in "reg-ex use" is to what the user would normally
> expect, the better.
Exactly.

Regards,
Rogier


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk