Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Artyom (artyomtnk_at_[hidden])
Date: 2011-01-29 04:24:35


-
> From: Dean Michael Berris <mikhailberis_at_[hidden]>
>
> On Sat, Jan 29, 2011 at 3:02 PM, Artyom <artyomtnk_at_[hidden]> wrote:
> >>
> >> > 1. "Contiguity"
> >> >
> >> > Continuity and c_str() is one of the most important
> >> > properties of C++ string (that is BTW required by C++0x)
> >>
> >> Eliminating c_str() doesn't mean there's no easy way to produce a
> >> contiguous NTBS.
> >>
> >
> > Yes, just it can't be really "char const *c_str() CONST" or would require
> > extra stuff like linearization.
> >
> > It would turn away 90% of users.
> >
>
> It might turn you away because you obviously love std::string.
> Generalizing is a different matter and is largely a hot-air blowing
> exercise that is futile for convincing anybody.
>

I would say it more clear:

1. All users that use C libraries and need c_str() at boundaries
   And this is a huge amount of users that need to communicate
   with modules that are already working and ready but written in C.

   And this is about of half of libraries there is C is the lowest
   level API that allows easy bindings to all languages.

2. All users of GUI toolkits like GTK, Qt, wxWidgets, MFC as they require
conversion
   of "boost::what_ever_is_it_called" to QString, ustring, wxString, CString.
   and it is done via C string.

3. All users who actually use Operating System API that uses C strings
   and require char *.

Plenty? Isn't it?

Please take a look on frequent cases of string usage and you'll
see how much do you indeed need rope like structure and how
much normal string.

Don't forget that almost all string implementations in all languages
are continuous single memory chunks.

> >> > 3. non-uniform-memory-architecture
> >> >
> >> > Give me a break... Who uses NUMA for string processing?!
> >>
> > 2. In such case it would be even better to have non-shared
> > strings
> >
>
> Weh?
>

Because of memory locality, think of part of string references
to "other memory"

> >> > 4. About string builder. Most languages require is as they
> >> > don't have "reserve" also if you want efficient
> >> > string builder use std::ostream with nice stream buffer.
> >>
> >> There's nothing efficient about std::ostream, no matter what buffer
> >> you put on it.
> >>
> >
> > I beg your pardon? It is efficient as all functions
> > are as efficient as memcpy with exceptions of overflow/underflow
> > happens which require some virtual functions calls
> > which are pretty fast as well...
> >
> > Also 99% of issues are just solved with reserve.
> > (and I work with text parsing, combining and processing a lot)
> >
>
> And you obviously don't work with systems that have to do this
> multiple thousand times in one second to not know what the effects of
> NUMA are and why allocating a contiguous amount of memory is the
> performance killer that it is.
>

I know, but I hadn't suggested that streambuf should use single memory
chunk.

> >> This, I believe, is a persistent misunderstanding. IIUC, Dean is only
> >> suggesting to avoid giving UTF-8 any special status in the string's
> >> interface. He's not arguing against using UTF-8 storage in the
> >> implementation.
> >>
> >
> > The entire "buzz" started with the fact that under windows
> > we have problems with string encoding not being UTF-8
> >
> > [snip]
> >
> > This article written from wrong understanding of real
> > problems - instead of solving a problem it suggests
> > some idea for some cases not looking to the problem
> > in hole.
> >
>
> [snip]
>
> The article was written from the understanding that the real problem
> stems from how std::string is broken. It already identifies why it's
> broken. It seems that you're just happy to attack people and the work
> they do more than you are interested in solving problems.
>
> If you disagree with what's being said argue on the merits of "why".
> Mud-slinging and sitting on a high horse and just saying "blech,
> you're wrong" is not helping solve any technical problems.
>

I'm sorry but I think that much more real problem is that:

- My father in law can't use Thunderbird because he defined non-ascii
  user name and Thunderbird fails to open the profile. So he needs to create a
new
  account because half of the other programs are broken when Unicode
  paths are used!

- That acrobat reader can't open files with Unicode file names that
  user have (at least it was last time I've tried it)

- That you can't write cross platform Unicode aware code using
  simple std::string/char * or what even encoding non-aware
  string there.

What you are doing is classical example if micro-optimization
that is concentrated on string storage.

Note: all strings in all toolkits around don't do much beyond
what std::string does in terms of storage (QString, ustring,
UnicodeString, wxString) all use same storage model, some
immutable that I can accept it but all:

1. Are Unicode aware
2. Use single memory chunk

There is a very good reason for this but it seems
that you just to get it why all strings designs
around this same principle.

Take a look on these fundamental operations you had written:

- Concatenation (generally ok)
- Substring - should be Unicode aware in most of cases
- Filtration - should be Unicode and Locale aware
- Tokenization - should be Unicode and Locale aware
- Search/Pattern Matching - should be Unicode and Locale aware.

So please if you don't understand why these fundamental
operations and why the string should relate to encoding
then you need to reread this thread.

New string that does not solve any of Unicode issues
has no place - and this is *real* problem.

Please don't write theories on C++ string if
you do not see what string is - text human
readable text that is much more complex
then set of byte chunks.

Artyom

      


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk