Boost logo

Boost :

Subject: [boost] Boost.Unicode (was Re: Boost.Locale)
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2010-12-15 08:15:46


On 15/12/2010 08:20, Matus Chochlik wrote:

> IMO a lot of people would find something like this extremely useful
> (even if not extremely efficient).
>
> str::string s = get_utf8_string();
> WhatEverWinapiFunc(..., convert_to<std::string<TCHAR>>(s).c_str(), ...);
>
> or
>
> str::wstring ws = get_string();
> AnotherWinapiFunc(..., convert_to<std::string<TCHAR>>(ws).c_str(), ...);

The interface is modeled after that of standard algorithms, and
therefore it takes an output iterator to write the output to, rather
than creating a container directly.

// ws is a std::string (utf-8) or std::wstring (utf-16 or utf-32).
std::basic_string<TCHAR> out;
utf_transcode<TCHAR>(ws, std::back_inserter(out));
AnotherWinapiFunc(..., out.c_str(), ...);

Assuming TCHAR is either char or wchar_t this should work out of the box.

The fact it takes an output iterator is quite practical, as you can
easily do two passes for example, one to count how many characters you
need, and one to copy that data.
Or you can just grow the container as you add elements, as
std::back_inserter does.

Something like

convert_to<std::basic_string<TCHAR>>(utf_transcode<TCHAR>(ws)).c_str()

would also work, but that's maybe a bit verbose.

> Another thing is some kind of adaptor for std::(w)string providing begin()/end()
> functions returning an iterator traversing through the code points instead
> of utf-XY "chars". i.e. in C++0x:
>
> std::string s = get_utf8_string();
> auto as = adapt(s);
> auto i = as.begin(), e = as.end();
> while(i != e)
> {
> char32_t c = *i;

Replace adapt(s) by utf_decode(s)

> ...
> *i = transform(c);

No, you can't do that.
Data accessed like this is immutable.

It's not impossible to make them mutable (a bit complicated in the code
though, the range concepts don't support inserting/erasing elements),
but it's probably not a good idea because it would be O(n) worst case.

If you really want to do that, you can already do it using i.base() and
next(i).base(), which gives you the range of the character in terms of
original std::string iterators, so you can use std::string::replace.

> ++i;
> }
>
> I have just scrolled through the docs for Boost.Unicode some time ago
> so maybe it is already there and I've missed it. If so, links to some
> examples showing this would be appreciated.

Of course it's there, transcoding between UTF encodings is the most
basic feature.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk