Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2011-01-26 03:25:56


On Wed, Jan 26, 2011 at 3:47 PM, Matus Chochlik <chochlik_at_[hidden]> wrote:
> On Fri, Jan 21, 2011 at 1:07 PM, Dean Michael Berris
> <mikhailberis_at_[hidden]> wrote:
>>
>> Mostly I'm interested in seeing a string class that is:
>>
>> 1. Immutable. No if's or but's about it. I don't want a string to be
>> modifiable. Period. You can create it, and once it's created, that's
>> it.
>>
>> 2. Has real value semantics. This means, once you've copied it, that's
>> really copied. No funky copy-on-write reference-counting mumbo-jumbo.
>
> I also prefer nothing too fancy. But most of these things
> are implementation details, let us get the interface
> right first and focus on the optimizations afterwards.

Actually, it's not an implementation detail. Value semantics has
everything to do the interface and not the implementation.

It's just that, at the time I was thinking about and writing this
reply, I was just really wanting something lightweight and allowed for
unbridled cross-thread access. That original assumption of mine that
reference counting was a bad thing has since been clarified by others
in the ensuing threads.

>>
>> 3. Has all the algorithms that apply to it defined externally.
>>
> [snip/]
>> Encoding is a matter of external interpretation and I think should not
>> be part of a string's interface. You can have wrappers that interpret
>> a string as a UTF-* string.
>
> I am all for a generalized-*string* class
> in the pedantic interpretation of the word
> i.e. a sequence of chars, char16_ts, bytes,
> octets, words, dwords, etc. without any enforced
> encoding for use-cases that call for it, but again,
>
> the reason why I participate in this whole discussion
> is because I think that C++ deserves also a class
> focused on the "everyday", *nice* and *convenient*
> handling of text, without having to worry about how
> do I need to "view" that raw-chunk-of-binary-data
> in this call to an OS API function and how
> do I have to "view" it in that other library call,
> explicitly specifying to which encoding I want
> to convert it using *ugly* :-) tag types, etc.
> (as much as this is possible).
>

But I we already have these everyday nice and convenient text handling
algorithms in Boost.Algorithm's String_algo library.

As a matter of fact, *all* the implementations cited about dealing
with UTF-8 and UTF-16 have everything to do with wrapping raw data
into a view of it that (unfortunately) allows for mutating
transformations.

Note also that I wasn't even going into the generic point of strings
being a sequence of anything other than characters to be read. That's
a different topic that I don't want to get into at this time. But even
the pedantic definition of a string doesn't include mutability as an
intrinsic requirement.

> Another important concern for me is portability.
> I'd like (being very self-centered :-P) for example
> the following:
>
> boost::string s = "Mat" + code_point(0x00FA/*u with acute*/) +
> code_point(0x0161/*s with caron*/);
> std::cout << s << std::endl;
>
> (everywhere where the terminal can handle it) to print:
> Matúš // hope your email client can handle that :)
>
> instead of:
> Mat$#@!%
> or completely upsetting the terminal.
>

A few things here:

1. This is totally fine with an immutable string implementation. I
don't see any mutations going on here.

2. A string class that "works correctly while immutable" allows for
dealing with arbitrary data interpreted as some thunk that is obtained
from a given source (as long as you have a length of the data that
is).

3. String I/O can be defined independently of the string especially if
you're dealing with C++ streams. I don't see why the above would be a
problem with an immutable string implementation.

4. I don't see why a hypothetical boost::string implementation that is
immutable would have portability problems when it just deals with
immutable thunks of memory that can be viewed in a different manner
depending on the encoding you want at the point where you need to be
dealing with a specific encoding.

> Also, while I see that for example this
>> auto it = encoded<utf8_encoding>(original_string), end =
>> encoded<utf8_encoding>();
> is perfectly generic and well-designed
> for some use-cases the first reaction of
> the-average-joe-programmer-inside-me's
> when seeing it was, *yuck*. Sorry :-)
>

So you'd say yuck to any STL algorithm that dealt with iterators? Have
you used the Boost.Iterators library yet because then you'd be calling
all those chaining/wrapping operations "yucky" too. ;)

> Sometimes it is more important for the code
> and people writing/maintaining it to be nice
> and easy to understand than to be
> really-really-generic and smart.
> That said, it *is* perfectly valid if someone
> uses the generic version above. Let's do both.
>

But the problem there is "nice" is really subjective. I absolutely
abhor code like this:

  boost::string s = "Foo";
  s.append("Bar").append("Baz");

When I can express it entirely with less characters and succinctly
with this instead:

  boost::string s = "Foo" ^ "Bar" ^ "Baz";

> The reason why I want to call it (std::)string
> is that many not-so-pedantic people would react
> to the question "What is your first thought when
> you hear 'string type'?" with "Some kind of type
> for handling text, eh?" and not with "Some kind
> of generalized sequence of elements without any
> intrinsic encoding having the following
> properties...". But if there is so much resistance
> to calling it that then I vote for (boost|std)::text
> (however this sounds a little awkward to me, I don't
> know why).
>

I think you're missing something here though.

The point of creating a new string implementation is so that you can
generalize a whole family of string-related algorithms around a
well-defined abstraction. In this case there's really no question that
a string of characters is used to represent "text" -- although it can
very well represent a lot of other things too. However you cut it
though the abstraction bears out of algorithms that have something to
do with strings like: concatenation, compression, ordering, encoding,
decoding, rendering, sub-string, parsing, lexical analysis, search,
etc.

These algorithms are applied to strings and there are a ton of
algorithms dealing with different kinds of strings.

Encoding (or interpreting) a string as UTF-8 is just one algorithm,
and it will be naive IMO if we design a string implementation just
around the idea that any string will need an encoding defined when the
algorithms that deal with strings are much more general in reality.

> Let us keep the basic_string<CharT> as that
> generalized string (I never suggested to dump it,
> just that std::string would be an another type and
> not defined as typedef std::basic_string<char>).
>

Like I said though, I think we're talking in different levels.

I for one think that solving the std::string problem brings more to
the world than just solving the encoding problem. Bold statement I
know. ;)

Also, last time I checked, there are already a ton of Unicode-encoding
libraries out there, I don't see why there's a need for
yet-another-encoding-library for character strings. This is why I
think I'm liking the way Boost.Locale is handling it because it
conveys that the library is about making a common interface through
which different back-ends can be plugged into. If Boost.Locale dealt
with iterators then I think having a string library that is better
than std::string in more ways than one gives us a good way of tackling
the cross-platform string encoding issue. But there I stress, I think
C++ needs a better than the standard string implementation.

> Regarding #1 above and the following ...
>> x = "Hello,";
>> x = x ^ " World!";
>
> ... would you be against, if the interface in addition also
> included a few convenience/backward compatibility
> member functions like ...
>
> string& append(const string& s)
> {
>        *this = *this ^ s;
>        return *this;
> }
>
> string& prepend(const string& s)
> {
>        *this = s ^ *this;
>        return *this;
> }
>
> ... etc? For the same reasons as above: clarity,
> simplicity (it may not be obvious what a fancy
> operator expression does, it is more obvious
> when using names like append, prepend, ...) and
> people are used to that programming style.
>

I think this is a slippery slope though. If we make the boost::string
look like something that is mutable without it being really mutable,
then you have a disconnect between the interface and the semantics you
want to convey.

Having member functions like 'append' and 'prepend' makes you think
that you're modifying the string when in fact you're really building
another string. I've already pointed out that string construction can
very well be handled by the string streams so I don't think we want to
encourage people to think of strings as state-ful objects with mutable
semantics because that's not the original intention of the string.

By forcing users of the string to make it look like they're building a
string instead of "modifying and existing string" *should* be conveyed
in the interface. This is largely an issue of documentation though.

The short answer to your question would be "yes, I am opposed to
having member functions similar to what you have pointed out above".
:)

> BR,

Thanks for taking the time and I hope this helps!

-- 
Dean Michael Berris
about.me/deanberris

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk