Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2011-01-29 06:32:00


On Sat, Jan 29, 2011 at 5:24 PM, Artyom <artyomtnk_at_[hidden]> wrote:
> -
>> From: Dean Michael Berris <mikhailberis_at_[hidden]>
>>
>> On Sat, Jan 29, 2011 at 3:02 PM, Artyom <artyomtnk_at_[hidden]>  wrote:
>> >
>> > It would turn away 90% of  users.
>> >
>>
>> It might turn you away because you obviously love  std::string.
>> Generalizing is a different matter and is largely a hot-air  blowing
>> exercise that is futile for convincing anybody.
>>
>
> I would say it more clear:
>
> 1. All users that use C libraries and need c_str() at boundaries
>   And this is a huge amount of users that need to communicate
>   with modules that are already working and ready but written in C.
>
>   And this is about of half of libraries there is C is the lowest
>   level API that allows easy bindings to all languages.
>

But c_str() doesn't have to be part of the string's interface.

> 2. All users of GUI toolkits like GTK, Qt, wxWidgets, MFC as they require
> conversion
>   of "boost::what_ever_is_it_called" to QString, ustring, wxString, CString.
>   and it is done via C string.
>

So, what was the point again?

> 3. All users who actually use Operating System API that uses C strings
>   and require char *.
>
> Plenty? Isn't it?
>

I know, so what is your point?

> Please take a look on frequent cases of string usage and you'll
> see how much do you indeed need rope like structure and how
> much normal string.
>
> Don't forget that almost all string implementations in all languages
> are continuous single memory chunks.
>

So what if all other string implementations in all languages are
contiguous? Does that mean that's the *only* way to do it?

Look, in my paper -- if you read and *understood* it -- I pointed out
that linearizing a string is an algorithm that deals with a string.
Much like how std::copy is an algorithm that is external to a
container, I see linearization as something not part of the string
interface. That was towards the end part. I never said that a string
shouldn't be linearizable.

>> > 2. In such case it would be even better to have  non-shared
>> >   strings
>> >
>>
>> Weh?
>>
>
> Because of memory locality, think of part of string references
> to "other memory"
>

Memory locality is solved by making it available to the cache. If you
have a contiguous chunk of 4kb *that never ever changes* then
accessing that memory from all the cores in a NUMA machine is largely
a matter of the cache reading part of that and making it available.
Making copies of the string is *unnecessarily wasteful*.

>> >
>> > I beg your pardon? It is efficient as all  functions
>> > are as efficient as memcpy with exceptions of  overflow/underflow
>> > happens which require some virtual functions  calls
>> > which are pretty fast as well...
>> >
>> > Also 99% of  issues are just solved with reserve.
>> > (and I work with text parsing,  combining and processing a lot)
>> >
>>
>> And you obviously don't work with  systems that have to do this
>> multiple thousand times in one second to not  know what the effects of
>> NUMA are and why allocating a contiguous amount of  memory is the
>> performance killer that it is.
>>
>
> I know, but I hadn't suggested that streambuf should use single memory
> chunk.
>

So then what's the point of making strings use a single contiguous
memory chunk if it's not necessary?

>> >
>> > This  article written from wrong understanding of real
>> > problems - instead of  solving a problem it suggests
>> > some idea for some cases not looking to  the problem
>> > in hole.
>> >
>>
>> [snip]
>>
>> The article was written from the understanding that the real  problem
>> stems from how std::string is broken. It already identifies why  it's
>> broken. It seems that you're just happy to attack people and the  work
>> they do more than you are interested in solving problems.
>>
>> If you  disagree with what's being said argue on the merits of "why".
>> Mud-slinging  and sitting on a high horse and just saying "blech,
>> you're wrong" is not  helping solve any technical problems.
>>
>
>
> I'm sorry but I think that much more real problem is that:
>
> - My father in law can't use Thunderbird because he defined non-ascii
>  user name and Thunderbird fails to open the profile. So he needs to create a
> new
>  account because half of the other programs are broken when Unicode
>  paths are used!
>

So fix Thunderbird.

> - That acrobat reader can't open files with Unicode file names that
>  user have (at least it was last time I've tried it)
>

So go work for Adobe and fix Acrobat Reader.

> - That you can't write cross platform Unicode aware code using
>  simple std::string/char * or what even encoding non-aware
>  string there.
>

You can, there are already libraries for that sort of thing if you
insist on using std::string.

> What you are doing is classical example if micro-optimization
> that is concentrated on string storage.
>

No. If you really did read the document and understood it, I was
talking at a high level about how to fix the problem by going through
the rationale for immutable strings. The storage issue is a necessary
component of the implementation efficiency concerns.

If you design something without thinking about the efficiency of the
solution, then you're doing art not engineering. I'm not an artist and
I sure want to think I'm an engineer by training and by trade.

Notice that I haven't mentioned any micro-optimizations or
micro-benchmarks in the document as well so I don't know where you're
coming from when you say I'm micro-optimizing anything.

> Note: all strings in all toolkits around don't do much beyond
> what std::string does in terms of storage (QString, ustring,
> UnicodeString, wxString) all use same storage model, some
> immutable that I can accept it but all:
>
> 1. Are Unicode aware
> 2. Use single memory chunk
>
> There is a very good reason for this but it seems
> that you just to get it why all strings designs
> around this same principle.
>

I do know why it's designed the same way: because that's the naive
thing to do. Someone thought "oh well, we can malloc a chunk of memory
and put a \0 in the end and call that a string". That worked up to a
certain level, and then when people started to look at a better way of
doing things, they saw that this isn't enough. Notice how the
applications you mention use a segmented data structure when dealing
with things like edit buffers or similar things. Strings are largely
reserved for "short" data and anytime you need anything "longer" you'd
use something else -- the question I'm trying to address is why can't
you use one data structure that will be efficient for both cases?
That's the point of the title which aims to come up with a singular
way of explaining how strings should be so that they're suitable for
short and long "strings of characters".

> Take a look on these fundamental operations you had written:
>
> - Concatenation (generally ok)
> - Substring - should be Unicode aware in most of cases
> - Filtration - should be Unicode and Locale aware
> - Tokenization - should be Unicode and Locale aware
> - Search/Pattern Matching - should be Unicode and Locale aware.
>
> So please if you don't understand why these fundamental
> operations and why the string should relate to encoding
> then you need to reread this thread.
>

But these are *algorithms* that should be aware of the encoding, *not
the string*. If you don't understand that point then you need to read
the document *again*.

The point is if you viewed a string a given way then that's largely an
implementation of the view. I haven't gotten to the explanation of the
view but I hinted that interpretation is a matter of composition. So
if you "wrap" a string and say that it should be interpreted one way,
then that's the whole point of enforcing an encoding on the view of
the string. The string itself doesn't *have* to be encoded a certain
way.

Now if strings are just values then the encoding in which they come
are largely a matter of implementation. Think of an int -- you don't
really know if it's big/small endian -- or a float -- whether it's
IEEE xxx or yyy. All that matters though is how the operations on
these strings are defined. The point of the abstraction is that you
have one way of dealing with the string as a value and write
algorithms around that abstraction.

> New string that does not solve any of Unicode issues
> has no place - and this is *real* problem.
>

You missed the point. It's not the string you want, it's the view of
the string you want when you're talking about encoding.

> Please don't write theories on C++ string if
> you do not see what string is - text human
> readable text that is much more complex
> then set of byte chunks.
>

See, I defined what a string is in that document. If you don't agree
with that definition then I can't help you. As much as humans want to
think that computers see the world the same way, unfortunately that's
not the case.

A string is a data structure. How you view a string in a given
encoding is a matter of algorithm. If you don't see that then I'm
sorry for you.

-- 
Dean Michael Berris
about.me/deanberris

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk