Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Matus Chochlik (chochlik_at_[hidden])
Date: 2011-01-26 09:21:39


On Wed, Jan 26, 2011 at 3:06 PM, Yakov Galka <ybungalobill_at_[hidden]> wrote:
[snip/]
>
> Fine. If immutable strings with backward compatibility results in changing
> string.resize(91);
> to
> string = string.resize(91);

I don't see why this should be needed

> I vote against immutability. Even if you through compatibility away on-one
> explained yet why immutable strings are better. For me it smells like
> "modern language here" influence.
>
[snip/]
>>
>> If you need just this, then why not use std::string as
>> it is now for my_string and use any of the Unicode libraries
>> around. What I would like is a string with which I *can* forget
>> that there ever was anything like other encodings than Unicode
>> except for those cases where it is completely impossible.
>>
>> And even in those cases, like when calling a OS API function
>> I don't want to specify exactly what encoding I want but just
>> to say: Give me a representation (or "view" if you like) of the string
>> that is in the "native" encoding of the currently selected locale
>> for the desired character type.
>>
>
> Let me try to explain myself in other words. I propose the iterator_range
> idea as means by which you (we) achieve our gual. It's more like a C++08
> concept. By "my_string whose encoding is known" I meant that strings like
> u8string should map to string_ranges with typename encoding == utf_8 (for
> example). As a result you *won't* need to specify the exact encoding because
> it will be deduced from the context. The only place you will write the
> encoding explicitly is at the boundaries of your code and legacy APIs.
>
> Look at the code you provided:
>
>
>> Something like this:
>>
>> whatever_the_string_class_name_will_be cmd = init();
>> system(cmd.native<char>().c_str());
>> ShellExecute(..., cmd.native<TCHAR>().c_str(), ...);
>> ShellExecuteW(..., cmd.native<wchar_t>().c_str(), ...);
>> wxExecute(cmd.native<wxChar>());
>>
>> or
>>
>> whatever_the_string_class_name_will_be caption = get_non_ascii_string();
>> new wxFrame(parent, wxID_ANY, caption.native<wxChar>(), ...);
>>
>
> The ShellExecuteW, wxExecute and wxFrame are actually *more verbose than
> they have to be*. wxString is documented to be utf16 encoded as well as
> LPCWSTR on windows. So, providing a mapping from wxString to the
> string_range concept you could write it as:
>
> wxExecute(cmd);  // creates utf16 wxString
> new wxFrame(parent, wxID_ANY, caption, ...); // creates utf16 wxString
>
> As a result *less* code will be affected when switching to utf8.

OK, if this is doable in the context of Boost, then you certainly
will not hear any complaining from me.

[snip/]
> This is what I meant.
>
>> // cp_begin returning a "code-point-iterator"
>> auto i = str.cp_begin(), e = str.cp_end();
>> if(i != e && *i == code_point(0x0123)) do_something();
>>
>> or even (if this is possible):
>>
>> // cr_begin returning a character iterator
>> auto i = str.cr_begin(), e = str.cr_end();
>> // if the first character is A with acute ...
>> if(i != e && *i == unicode_character({0x0041, 0x0301}))
>>    do_something();
>>
>
> I prefer:
> auto i = codepoints(str).begin(), e = codepoints(str).end();
> auto i = characters(str).begin(), e = characters(str).end();

I really don't insist on cr_begin, etc. to be member functions
(nor on calling them cr_begin, ..., for that matter).

>
> So
> 1) we can extend the syntax uniformly to words, sentences etc...
> 2) str may be of any type that maps to string_range concept. Will it be
> boost::string or (when a switch to utf8 occurs) std::string a string
> literal.
>
> If str is not mapped to string_range then the programmer must specify the
> encoding explicitly.
> std::string str = "hi";
> const char* str2 = exception.what();
> auto i = codepoints(treat_as<utf_8>(str)).begin(); // no-copy, no-op, just a
> cast.
> auto i = codepoints(treat_as<utf_8>(str2)).begin(); // works
> auto i = codepoints(str).begin(); // error: string is of unknown encoding.
> Compiles in 20 years when everyone uses utf8.
>
> boost::string (whatever name) will be just an std::string mapped to
> string_range in utf_8 encoding.

If we can wrap the treat_as<utf_8> into something that
does not refer to any encoding whatsoever in cases
you don't have to then *thumbs up*.

OK

Matus


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk