Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Artyom (artyomtnk_at_[hidden])
Date: 2011-01-30 03:27:18


> From: Dean Michael Berris <mikhailberis_at_[hidden]>
>
> On Sat, Jan 29, 2011 at 11:25 PM, Artyom <artyomtnk_at_[hidden]> wrote:
> >> From: Dean Michael Berris <mikhailberis_at_[hidden]>
> >> On Sat, Jan 29, 2011 at 8:06 PM, Artyom <artyomtnk_at_[hidden]> wrote:
> >>
> >> No, it's not obvious. Here's why:
> >>
> >> fd = creat(file.c_str(), 0666);
> >>
> >> What does c_str() here imply? It implies that there's a buffer
> >> somewhere that is a `char const *` which is either created and
> >> returned and then held internally by whatever 'file' is.
> >
> > It implies that const file owns const buffer that holds null terminated
> > string that can be passed to "char const *" API.
> >
>
> Yes, which is the problem in the first place. Every instance of string
> would then need to have that same buffer even if that string is just a
> temporary or worse just a copy.
>

It would not happen if it holds the data as linear single chunk :-)

> >> Now let's say
> >> what if file changed in a different thread, what happens to the buffer
> >> pointed to by the pointer returned by c_str()? Explain to me that
> >> because *it is possible and it can happen in real code*.
> >
> > I'm sorry but string as anything else has value semantics
> > that is:
> >
> > - safe for "const" access from multiple threads
> > - safe for mutable access from single thread
> >
> > I don't see why string should be different from
> > any other value type like "int" because following
> >
> > x+=y + y
> >
> > is not safe for integer as well.
> >
> > The code I had shown is **perfectly** safe with
> > string has has value semantics (which std::string has)
> >
>
> Now the point I was making was, in the case of a string that is
> immutable, you don't worry about the string changing *ever*, don't
> need a contiguous buffer for something that isn't explicitly required.
> It's value semantics *plus* immutability.
>

So you either should forbid

    your_string const &operator=(your_string const &)

Which makes it even more useless (IMHO)

Or your statement is wrong because assignment
can happen for example from other thread like

   str = str + " suffux"

And this would change the string in run time.

Your statement is false unless I miss
something or you want to put a mutex inside
a string or some other atomic variable.

I don't see any reason to not to treat
a string as any other value.

> >
> > I think we both and 95% of C++ programmers that use STL
> > know what is the semantics of std::string::c_str()
> >
> >> I like this better:
> >>
> >> char * filename = (char *)malloc(255, sizeof(char)); // I know I
> >> want 255 characters max
> >> if (filename == NULL) {
> >> // deal with the error here
> >> }
> >> linearize(substr(file, 0, 255), filename);
> >> fd = creat(filename, 0666);
> >>
> >
> > Sorry? Is this better then:
> >
> > fd=create(filename.substr(0,256).c_str(),O_EXCL...)
> >
> > Which by the way is 100% thread safe as well (but still may throw).
> > Even thou I can't see any reason to cut 256 bytes before create
> >
>
> No, not better because std::string's substr() will return a temporary,
> which means it will be a copy -- meaning another allocation and a call
> to memcpy(...). You don't get the benefit of COW on this one because
> you need to cut the string down to a "maximum size".
>

Actually small notice you will have to copy because C API
expects NULL terminated string, so you can't avoid memory
copy if the last byte is not NULL.

So no difference there

> I don't know if you know, but them C APIs from OSes have a defined
> maximum on the lengths of filenames and things like that...
>

Not exactly maximal length is something much more complicated,
run-time and specific OS dependent, but this is other story.

> >> And then the problem is not addressed then of the unnecessary contiguous
> >>buffer.
> >>
> >
> > There is good idea to have some non-linear data storage but
> > it should used in very specific cases.
> >
> > Also what is really large string for you that would have
> > performance advantage not being stored lineary.
> >
> > Talk to me in numbers?
> >
>
> Say on a machine+OS combo that has 4kb pages, a large string would be
> something that spans more than one memory page -- i.e. >4kb.
>
> Now a "short" string is one that can fit within a page. For
> concatenated strings, it's fine to compact/copy the substrings into a
> growable/shrinkable block. The overhead for a short string would be
> constant, as the concatenation tree would be a pointer and a length
> with a reference count integer.

Actually

I actually mean benchmarks how long string should be that it would be
more efficient to use as chunks.

So you suggest 4k

Which:

1. Covers almost all full file path names on every operating system
2. Covers almost all messages in text dialogs around
3. Covers all possible user data etc.

So basically a single memory chunk is efficient for 99% of use
cases.

Now lets take it to extreme:

- Longest Wikipedia article: 388k = ~ 39 pages
- Book: War and Peace: text size... 3.1MB = 800 pages.

How frequent this case? Very rare.

See text is basically something quite short,
long books were successfully written in days
where 640K was more then enough. So the real
benefits of non-linear data structure are rare,
on the other had they add too much complexity
for every day use.

So basically:

1. 99% of use cases fit to single page.
2. Very few cases would actually make a use of multi-page
   architecture.
3. The extreme cases that may benefit of this data
   structure are very rare and probably should use their
   own structures.

Just to clean up all the things

1. I do think that what you suggest is interesting
   and fine data structure.
2. I do think that in certain cases it would be very
   useful.
3. I do think that you have good experience with situations
   were such structure may be very useful.
4. I know your works (netlib) and I really appreciate
   what you do .

However I still think:

1. Such structure does not give much benefits in main
   stream cases for string data structure in its
   text meaning (which what string for 99% programmers is).

   I mean non-linear memory is not really so
   useful for common use case.

2. I think you should take a look on major string
   use cases to decide what is better for
   "next C++ string" if you want to develop some.

Also as you probably know I'm author of several
projects most of them are strongly tied to text
processing and handing:

1. Boost.Locale - strongly text and Unicode
   oriented.

2. CppCMS - C++ web framework that deals with
   strings and networking in most of its code.

   That by the was was the reason for Boost.Locale
   to be developed.

3. BidiTeX - bidirectional support for LaTeX/Hebrew
   which is mostly deals with text.

So I do have some basic view on what are the use
cases of strings, I don't say I know them all
but I developed some "feeling" about what text
processing needs and in my opinion you
just miss the major use case.

I think I'll stop trying to convince you
that it is wrong way to look at string
just because I think that users would
know what to pic in real applications.

Best Regards and Good Luck,
  Artyom

      


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk