|
Boost : |
Subject: Re: [boost] [unicode] Interest Check / Proof of Concept
From: James Porter (porterj_at_[hidden])
Date: 2008-11-19 18:50:41
Phil Endecott wrote:
> Mutable vs. immutable strings is something that has been briefly
> discussed before. My personal preference has been for mutable strings,
> but without the O(1) random access guarantee of a std::string. I also
> considered strings where the only mutation allowed is appending, i.e.
> there's a back_insert_iterator. Why do you prefer immutable strings?
I don't have any problem with appendable strings, but mutating
mid-string obviously has a potentially heavy performance hit. You could
allow it, and just let the user decide if he wants to deal with the
performance cost, but I'm not sure if I'm satisfied with that, so I left
the strings immutable for now.
Mutability also raises questions of whether to allow mutation when using
code point iterators or raw iterators or both, and whether you should go
even further in mimicking std::string and allow random access (whether
that would be raw access or psuedo-"random access" of codepoints, I
don't know).
> I also have run-time and compile-time tagging. My feeling now is that
> compile-time-tagging is the more important case. Data whose encoding is
> known only at run-time can be handled using a more ad-hoc method if
> necessary. I also struggled to find good names for these things; I
> don't find ct_string and rt_string great. Do any readers have suggestions?
In this thread, Andrew Sutton suggested using a single string type, like
this:
template<typename EncodingT = runtime_tag>
class estring { /* ... */ };
estring<> my_runtime_string;
estring<utf8> my_compiletime_string;
I'm still not sure about "estring", but at least it halves the number of
names we need to come up with!
> Well it's actually decoding the utf16 and encoding the utf8. Maybe
> "transcode", and preferably as a free function:
>
> transcode(bar,foo);
Fair point. I'm not quite sure of how I'd want it to work as a free
function yet, especially with regards to runtime-tagged strings.
>> rt_string baz;
>> baz.encode(bar,rt::utf8);
>
> So the encoding of the rt_string is not stored in the string?
It does store the encoding, but the call to "encode" provides it with a
*new* encoding type (utf8 in this case).
> I'll try to find time to have a look, but I do encourage you to post
> more details to the list. That tends to generate more discussion than
> "please look at the code" proposals do.
Fair enough. I didn't want to inundate people with pages of text about
implementation details, so I tried to stay fairly high-level. I'll
provide more details as soon as possible, but I thought it best to start
with a brief overview to make sure I didn't make everyone's eyes glaze
over! :)
> So what is your underlying implementation? Not std::string?
Right now, it's just a static char array, for ease of implementation.
Obviously this will change, but I was more focused on designing an
interface that allowed compile-time and runtime determined transcoding
of strings.
> - A complete character set library is a lot of work.
>
> - A library that only understands Unicode is less work, but is it what
> people need?
I tried to address both of these issues by making it easy to extend
character encodings with whatever obscure encodings you need. I probably
wouldn't write an EBCDIC facet, but I'd certainly want people to be able
to roll their own if they need it.
> - Is there a consensus about mutable vs. immutable strings? Perhaps we
> should start by defining a new string concept, removing the
> character-set-unfriendly aspects of std::string like indexing using
> integers, and see what people think of it. I have been trying to use
> only std::algorithms and iterators with strings in new code, but it can
> often be simpler to use indexes and the std::string members that use or
> return them.
We should definitely take a look at std::string and try to extract the
essentially string-y components of it. std::string makes an awful lot of
assumptions about what's *in* a string, so it would be good to remove
all the unnecessary bits.
In an alternate universe, it may have been better to have std::string
and encoded_string (or whatever it should be called) act as views onto a
collection of bytes/words. Of course, encoded_string could act as a view
onto std::string (and/or QString, CString, MySpecialString), though I'm
not convinced that's a good solution at all!
> - It would be useful to factor out the actual Unicode bit-bashing
> operations. I have implementations of them that I have carefully tuned,
> and they are ready for wider use even though the rest of my code isn't.
My code is organized such that each encoding is a class with a read and
write method. My encoding classes don't feature much in the way of error
handling (yet), but they do work well with compliant strings.
I'll take some time to look at your code and see what the differences
are compared to mine. I think one of the problems we'll run into is that
everyone has their own very particular ideas of what a Unicode library
means, that it'll be extremely hard to please everyone, no matter what
the interface ends up looking like.
- Jim
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk