Boost logo

Boost :

From: Jeremy Maitin-Shepard (jbms_at_[hidden])
Date: 2004-04-16 17:05:16

"Reece Dunn" <msclrhd_at_[hidden]> writes:

> [snip]

>> One issue which would need to be dealt with is that while it seems
>> necessary for some containers, such as a rope, to have direct access to
>> the container, publicly inheriting this unicode_string from the
>> container type means that the additions to the interface must be more
>> limited.

> I don't get this. Surely the roped_vector, or whatever rope-like container is
> used, will have an STL container interface like std::vector so you would use
> them interchangeably. The unicode_string facilities would make use if the
> insert/append functions, iterators, etc of the storage container to implement
> their specific facilities.

The issue is that the benefits of using a specialized data structure
such as a rope are likely seen only by using the rope-specific
interface; the container interface would probably not provide many
advantages. Access to the underlying container could nonetheless be
provided by a container() method.

> [snip]

>> There is a significant advantage, however, in standardizing on a single
>> POD-array representation as well, such that much of the library need not
>> be templated (and thus implemented largely in header files, and thus
>> compiled each use), and less efficient functions could also be provided
>> which allow arbitrary iterators through runtime polymorphism. I think
>> it will be particularly important to examine this trade-off, because
>> Unicode support involves a large number of heavy-weight facilities.

> Agreed. However, this is contradictory to allowing the user to specify the
> container used for string storage.

Yes, I realize that. On the one hand, I really like making everything
work with any encoding, any container, etc. On the other hand, I don't
think it is feasible to stick everything in header files, although it
may prove possible to make the access to the locale and Unicode data
non-templated, and thus limit the code that must be in the header files.

> Maybe having a templatized version for users that want a custom storage policy,
> like a rope, and a static representation (UTF-16?) for those that are not
> bothered about how the unicode string is stored. The interfaces of these should
> be the same to allow the higher-level facilities to interoperate with both
> representations.

It seems that this might introduce even more overhead. A less
run-time efficient, but more code-size efficient and compile-time
efficient solution would be, as I described, to provide a UTF-16 array
interface and a run-time polymorphic iterator interface, which would
be used (automatically) for all non-UTF-16 array sources/iterators.

In practice then, it might be useful to limit unicode_string at least
to containers which can provide an array of code units, such as vector
or basic_string. (Unfortunately, the interface for getting the array
of code units differs for vector and basic_string.) Alternatively, it
might make sense to not allow the user to specify a storage container
to unicode_string. Maybe you have some other ideas about this.

To get a sense of just how complex Unicode handling is, download the
source to the ICU library: (9.4 MB) or (8.3 MB)

For instance, searching is implemented in usearch.cpp.

> [snip: searching]

>> [snip: grapheme cluster iterator notes]

> That makes sense. Going further into grapheme clusters would be too complicated
> for a generic unicode library, as you would need to then consider how to map the
> cluster into the appropriate font: this would be platform specific and far too
> complex (e.g. overlaying combining marks, etc.)!

The ICU library provides some additional facilities which could be used
by formatting engines, such as an implementation of the Unicode
Bidirectional Algorithm. It might be best to avoid trying to add such
facilities to Boost, however.

Jeremy Maitin-Shepard

Boost list run by bdawes at, gregod at, cpdaniel at, john at