Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Yakov Galka (ybungalobill_at_[hidden])
Date: 2011-01-26 09:06:26


On Wed, Jan 26, 2011 at 15:04, Matus Chochlik <chochlik_at_[hidden]> wrote:

> On Wed, Jan 26, 2011 at 12:42 PM, Yakov Galka <ybungalobill_at_[hidden]>
> wrote:
> > On Wed, Jan 26, 2011 at 11:54, Matus Chochlik <chochlik_at_[hidden]>
> wrote:
> [snip/]
> >>
> >> I'm fairly neutral on the immutability issue, I do not oppose it if
> >> someone shows why it is a superior design, provided it does not
> >> break everything horribly (from the backward compatibility perspective).
> >>
> >
> > Me too, but it definitely will break existing code:
> > string.resize(91);
>
> This is just one of the examples. The append/prepend/etc.
> are others. The question is: do we allow them for the
> sake of the backward compatibility and implement them
> by using the immutable-semantic. Even resize could be
> implemented this way. Another matter is whether it makes
> sense.
>

Fine. If immutable strings with backward compatibility results in changing
string.resize(91);
to
string = string.resize(91);
I vote against immutability. Even if you through compatibility away on-one
explained yet why immutable strings are better. For me it smells like
"modern language here" influence.

> [snip/]
> >
> > My point is that 'Unicode-functionality' should be separate from the
> string
> > implementation. This code
> > for(char32_t cp : codepoints(my_string));
> > should work with any type of my_string whose encoding is known.
>
> If you need just this, then why not use std::string as
> it is now for my_string and use any of the Unicode libraries
> around. What I would like is a string with which I *can* forget
> that there ever was anything like other encodings than Unicode
> except for those cases where it is completely impossible.
>
> And even in those cases, like when calling a OS API function
> I don't want to specify exactly what encoding I want but just
> to say: Give me a representation (or "view" if you like) of the string
> that is in the "native" encoding of the currently selected locale
> for the desired character type.
>

Let me try to explain myself in other words. I propose the iterator_range
idea as means by which you (we) achieve our gual. It's more like a C++08
concept. By "my_string whose encoding is known" I meant that strings like
u8string should map to string_ranges with typename encoding == utf_8 (for
example). As a result you *won't* need to specify the exact encoding because
it will be deduced from the context. The only place you will write the
encoding explicitly is at the boundaries of your code and legacy APIs.

Look at the code you provided:

> Something like this:
>
> whatever_the_string_class_name_will_be cmd = init();
> system(cmd.native<char>().c_str());
> ShellExecute(..., cmd.native<TCHAR>().c_str(), ...);
> ShellExecuteW(..., cmd.native<wchar_t>().c_str(), ...);
> wxExecute(cmd.native<wxChar>());
>
> or
>
> whatever_the_string_class_name_will_be caption = get_non_ascii_string();
> new wxFrame(parent, wxID_ANY, caption.native<wxChar>(), ...);
>

The ShellExecuteW, wxExecute and wxFrame are actually *more verbose than
they have to be*. wxString is documented to be utf16 encoded as well as
LPCWSTR on windows. So, providing a mapping from wxString to the
string_range concept you could write it as:

wxExecute(cmd); // creates utf16 wxString
new wxFrame(parent, wxID_ANY, caption, ...); // creates utf16 wxString

As a result *less* code will be affected when switching to utf8.

> In many cases the above could be a no-op, depending on the
> *internal* encoding used by this string class. It could be
> UTF-8 by default and maybe UTF-16 on Windows.
>
> Specifying *exactly* (like with iso_8859_2_cp_tag, or utf32_cp_tag, ...)
> which encoding I want, should be done only when absolutely
> necessary and *not* every time when I want to do something
> with the string.
>
> Also, there should be iterators allowing you to do this, again
> without specifying what encoding you want exactly:
>

This is what I meant.

> // cp_begin returning a "code-point-iterator"
> auto i = str.cp_begin(), e = str.cp_end();
> if(i != e && *i == code_point(0x0123)) do_something();
>
> or even (if this is possible):
>
> // cr_begin returning a character iterator
> auto i = str.cr_begin(), e = str.cr_end();
> // if the first character is A with acute ...
> if(i != e && *i == unicode_character({0x0041, 0x0301}))
> do_something();
>

I prefer:
auto i = codepoints(str).begin(), e = codepoints(str).end();
auto i = characters(str).begin(), e = characters(str).end();

So
1) we can extend the syntax uniformly to words, sentences etc...
2) str may be of any type that maps to string_range concept. Will it be
boost::string or (when a switch to utf8 occurs) std::string a string
literal.

If str is not mapped to string_range then the programmer must specify the
encoding explicitly.
std::string str = "hi";
const char* str2 = exception.what();
auto i = codepoints(treat_as<utf_8>(str)).begin(); // no-copy, no-op, just a
cast.
auto i = codepoints(treat_as<utf_8>(str2)).begin(); // works
auto i = codepoints(str).begin(); // error: string is of unknown encoding.
Compiles in 20 years when everyone uses utf8.

boost::string (whatever name) will be just an std::string mapped to
string_range in utf_8 encoding.

[...]
>

-- 
Yakov

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk