Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Robert Kawulak (robert.kawulak_at_[hidden])
Date: 2011-01-18 18:00:59


> From: Artyom
> Ok let's thing what do you need iterators for? Accessing "characters"
> if so you are most likely doing something terribly wrong as you ignore
> the fact that codepoint != character.
>
> I would say such iterator is wrong by design unless you develop
> a Unicode algorithm that relates to code point.

Now wouldn't it be nice if ascii_t (or whatever it's called) and utf*_t string classes had 3 kinds of iterators:
- storage iterator (char, wchar_t etc.),
- codepoint iterator,
- character iterator.

You could then reuse many existing algorithms to perform operations on a level that is sufficient in a given situation, like:

- bitwise copy:
    std::copy(utf8_1.storage_begin(), utf8_1.storage_end(),
        utf8_2.storage_begin())
- check if utf32 is a substring of utf8, codepoint-wise:
    std::search(utf8.codepoint_begin(), utf8.codepoint_end(),
        utf32.codepoint_begin(), utf32.codepoint_end())
- character-wise copy ascii_t to utf_16, considering the codepage of ascii object:
    utf16_t utf16(ascii.character_begin(), ascii_t.character_end())
- count codepoints:
    std::distance(utf8.codepoint_begin(), utf8.codepoint_end())
- count characters:
    std::distance(utf8.character_begin(), utf8.character_end())
- get the 5th codepoint:
    std::advance(utf8.codepoint_begin(), 5)

I don't know Unicode quirks enough to tell how useful this interface would be, but it seems interesting. What do you think?

Best regards,
Robert


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk