Boost logo

Boost :

From: Erik Wien (wien_at_[hidden])
Date: 2005-03-17 11:52:25


Rogier van Dalen wrote:
> Hi Erik!

Hi!

> I'm glad to see you've made a lot of progress in these months of
> silence. I've got a few comments for now.
>
> Of course there isn't much documentation yet, but now that the library
> is out in the open, writing a Unicode primer might be a good thing to
> do now. Issues that I don't think many programmers are aware of
> include (off the top of my head) what code points are (21 bits), what
> Unicode characters are, why you need combining characters, why UTF-32
> is not usually optimal. The library will once need these docs anyway.
> I'd gladly help out with this, though I'm not sure this would fit your
> university's requirements.

Yep. You are absolutely right. That would greatly cut down on the time
used to explain these concepts to people here on the list. As you say,
the boost documentation will probably need this too eventually, so it's
a good idea to do it now. You could always do it if you want to, but we
will have to write some paragraphs about this for our report too, so we
might as well do it ourself. No need for two people doing the same thing.

> Some speculation on the Unicode database: do you really need the
> character names? Maybe you should use multi_index, probably with
> hashing. Maybe you could use Boost.Serialisation for loading the file.

We probably won't need the names, and we have been speculating on taking
them out. (The Unicode 1.0 and ISO names will probably go regardless) We
thought about using multi_index but (correct me if I'm wrong) wouldn't
that make is neccessary to fill the database at run-time? With
serialization, the overhead would probably not be that bad, but still.

> I think that in general you would need to separate input/output from
> other Unicode processing. For example: endianness only matters when
> portably reading/writing files; IMO strings in memory should have your
> platform's endianness. (I second Thorsten's proposal of having
> utf8_string, utf16_string, utf32_string, utf_string.)
> For reading code points from files, a codecvt could be used. This can
> be fast because its virtual functions are called only once per so many
> bytes.
> I think there's an implementation floating around in the yahoo files
> section that can automatically figure out the file's encoding and
> convert to and from any endianness.

I have an idea on how to implement iostream support in the library
(Wrote it in another mail here), but I'm not really sure if it would
work. Could you perhaps verify that?

> I also think you should separate code points and Unicode characters.
> In normal situations, the user should not have to deal with code
> points. The discussion should not focus on that for now; it's an
> implementation detail. I strongly object to your
>
> typedef encoded_string<unicode_tag> unicode_string;
>
> because I think a Unicode string should contain characters. For
> example, a regular expression on Unicode strings should support level
> 2 (see <http://www.unicode.org/reports/tr18/>). Why go for anything
> less?

What exactly do mean by the term "character"? Abstract characters?

I do agree with you on the level 2 support though. The closer the
behaviour of a string in "reg-ex use" is to what the user would normally
expect, the better.

- Erik


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk