Boost logo

Boost :

From: Nils Springob (nils.springob_at_[hidden])
Date: 2006-09-18 14:49:50


That's fine! I didn't expected it to be in the regex header dir...

OK, using the existing code would give something like the following:

template <class Char8Iterator = std::string::iterator>
typedef unicode_string<u32_to_u8_iterator<Char8Iterator>, u8_to_u32_iterator<Char8Iterator>> utf8_string;

template <class Char16Iterator = std::basic_string<boost::uint_16_t>::iterator>
typedef unicode_string<u32_to_u16_iterator<Char16Iterator>, u16_to_u32_iterator<Char16Iterator>> utf16_string;

template <class Char32Iterator = std::basic_string<boost::uint_32_t>::iterator>
typedef unicode_string<Char32Iterator, Char32Iterator> utf32_string;

This would store utf8 and utf16 strings internally in the raw format and allow access to the 32 bit utf32
values. The unicode_string class should implement most of the std::basic_string methods, however the
complexity of these methods would be in most cases linear!

bool empty() // constant = O(1);
size_type size() // linear = O(s2.size());
append (const unicode_string & s2) // linear = O(s2.size());
append (uint32_t & uc) // constant = O(1);
insert (size_type pos, const unicode_string & s2) // linear = O(s1.size()+s2.size());
insert (size_type pos, uint32_t & uc) // linear = O(s1.size());
int compare (const unicode_string & s2) // linear = O(s1.size()+s2.size());
erase (size_type pos, uint32_t & uc) // linear = O(s1.size());
replace (size_type i, size_type n, const unicode_string & s2) // linear = O(s1.size()+s2.size());
substr (size_type i, size_type n) // linear = O(s1.size());

all find methods would have the same complexity as the corresponding std::basic_string methods, because
they can be transformed to work on the raw data!

isalpha, isupper and the other functions can be defined on the utf32 values (or the wchar_t version can
be used on some platforms) in the boost namespace. In the same way, the Unicode Category Values can be
implemented for the utf32 values.

Additionally there could be iterators to support other encodings like latin1 (latin1_to_u32<> and u32_to_latin1<>).

Transformations could be done by simple assignment:

latin1_string l1 = "a simple test with äöü"; // given that latin1 is the system encoding
utf8_string u8 = l1;

this would convert a latin1 encoded string into a utf8 encoded string.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk