Boost logo

Boost :

From: Kirit Sælensminde (kirit.saelensminde_at_[hidden])
Date: 2007-09-27 04:05:48


(Sorry if this is a double post, I'd not subscribed to the list first time)

Joseph Gauterin wrote:
>>> If you change state_type in the char_traits, you'd be able to
>>> differentiate the various basic_string types and include information
>>> about the character encoding without writing a whole lot of new code.
>> Thanks for the suggestion. I need to learn some more about this corner
>> of "namespace std", clearly, before I go and re-invent something.
> IIRC, some of the non-const std::basic_string methods aren't suitable
> for handling variable width encodings like utf8 and utf16 - non-const
> operator[] in paticular returns a reference to the character type - a
> big problem if you want to assign a value > 0x7F (i.e. a character
> that uses 2 or more bytes).
>
> I've noticed that there are frequent requests/proposals for some sort
> of boost unicode/string encoding library. I've thought about the
> problem and it seems to big for one person to handle in their spare
> time - perhaps a group of us should get together to discuss working on
> one? I'd be happy to participate.

I'm going to chime in here to say that I've been using a string
implementation similar to this for a few years now. Our systems are on
Windows so we want UTF-16 where we interface with Windows APIs and other
Windows software, but we wanted to put all of the surrogate pairs stuff
in one place.

Our FSLib::wstring uses UTF-32 characters for character interfaces (i.e.
at() and operator[]), but UTF-16 internallly. We throw out the non-const
operator[] and the non-const iterator. They haven't really been missed.
We also have to offer a std_str() which returns a std::wstring and
buffer_begin() and buffer_end() which return wchar_t* so we can use
Boost.Regex etc.

I've also started looking at tagged types for many of the same sorts of
things already mentioned. I also want to use them to describe other
types of encodings such as HTTP query string and file specification
encodings, HTML attribute encoding, SQL statement string encoding etc.
The idea being here that it would be impossible to concatenate a query
string encoded string to a HTML attribute encoded one without using the
correct conversion function.

The idea here is to improve security to defeat things like XSS attacks
on web servers and SQL injection attacks. I've been looking at making
the conversions happen through explicit constructors in order to make it
easier to use.

A final thing I've just started to look at is to get the compiler to
choose the best internal representation out of UTF-8, UTF=16 and UTF-32
for general use, but it's not something I've gotten very far with.

K


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk