Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2011-01-29 11:03:28


On Sat, Jan 29, 2011 at 11:25 PM, Artyom <artyomtnk_at_[hidden]> wrote:
>> From: Dean Michael Berris <mikhailberis_at_[hidden]>
>> On Sat, Jan 29, 2011 at 8:06 PM, Artyom <artyomtnk_at_[hidden]>  wrote:
>>
>> No, it's not obvious. Here's why:
>>
>>   fd =  creat(file.c_str(), 0666);
>>
>> What does c_str() here imply? It implies that  there's a buffer
>> somewhere that is a `char const *` which is either created  and
>> returned and then held internally by whatever 'file' is.
>
> It implies that const file owns const buffer that holds null terminated
> string that can be passed to "char const *" API.
>

Yes, which is the problem in the first place. Every instance of string
would then need to have that same buffer even if that string is just a
temporary or worse just a copy.

>
>> Now let's  say
>> what if file changed in a different thread, what happens to the  buffer
>> pointed to by the pointer returned by c_str()? Explain to me  that
>> because *it is possible and it can happen in real code*.
>
> I'm sorry but string as anything else has value semantics
> that is:
>
>   - safe for "const" access from multiple threads
>   - safe for mutable access from single thread
>
> I don't see why string should be different from
> any other value type like "int" because following
>
>  x+=y + y
>
> is not safe for integer as well.
>
> The code I had shown is **perfectly** safe with
> string has has value semantics (which std::string has)
>

Now the point I was making was, in the case of a string that is
immutable, you don't worry about the string changing *ever*, don't
need a contiguous buffer for something that isn't explicitly required.
It's value semantics *plus* immutability.

std::string *is* mutable and now that means the original pointer
returned by a call to c_str() may be invalid by the time the C API
accesses that pointer because the original string may have already
changed (potentially changing the pointer returned by c_str() at a
later time).

>>
>>    fd = creat(linearize(file), 0666); // rely on ADL
>>
>> This is also bad  because linearize would be allocating the buffer for
>> me which I might not be  able to control the size of or know whether it
>> will be a given length --  worse it might even throw.
>
> Exactly!
>
>   c_str() never throws as it is "const" member function in
>   C++ string semantics.
>
>   So for example this code is fine with const c_str()
>
>   bool create_two_lock_files(string const &f1,string const &f2)
>   {
>     int fd1=creat(f1.c_str(),O_EXCL ...)
>     if(fd2==-1)
>         return false;
>     int fd1=creat(f2.c_str(),O_EXCL ...)
>     if(fd2==-1) {
>         unlink(f1.c_str());
>         close(fd1);
>         return false;
>     }
>     close(f1);
>     close(f2);
>     return true;
>   }
>
> It would not work with all linerazie stuff because
> it would not be exception safe and would require
> me to create a temporary variable to store f1 linearized.
>

But see, the point I was making was (which apparently you missed or I
was unclear):

  f1.c_str()

Can return a pointer at the time of the call, and by the time the C
API goes and accesses that pointer, f1 could have already changed
underneath and the previous buffer location would have already
changed. That's unless f1 is a temporary, but note that the same API
can be called to refer to a non-const `std::string`. As in:

  std::string f1 = "foo", f2 = "bar";
  // start threads that may deal with f1 as an lvalue reference
  create_two_lock_files(f1, f2); // f1.c_str() may be invalidated
after c_str() is called.

In a world where you couldn't change a string, this use case would be
simplified and all requirements to linarize a string are made explcit
-- so now you know exactly when you need to have the data linearized.

Also when copying immutable strings around, the cost would potentially
be the same cost of copying a shared_ptr<> -- which is incrementing a
reference count and copying a pointer. It doesn't need to copy the
contents anymore for safety because *it will never change*.

>> It also may mean
>> that the buffer is static, or I  have to manage the pointer being
>> returned.
>
> I think we both and 95% of C++ programmers that use STL
> know what is the semantics of std::string::c_str()
>
>> I like this better:
>>
>>    char * filename = (char *)malloc(255, sizeof(char)); // I know I
>> want 255  characters max
>>   if (filename == NULL) {
>>     // deal with  the error here
>>   }
>>   linearize(substr(file, 0, 255),  filename);
>>   fd = creat(filename, 0666);
>>
>
> Sorry? Is this better then:
>
>    fd=create(filename.substr(0,256).c_str(),O_EXCL...)
>
> Which by the way is 100% thread safe as well (but still may throw).
> Even thou I can't see any reason to cut 256 bytes before create
>

No, not better because std::string's substr() will return a temporary,
which means it will be a copy -- meaning another allocation and a call
to memcpy(...). You don't get the benefit of COW on this one because
you need to cut the string down to a "maximum size".

I don't know if you know, but them C APIs from OSes have a defined
maximum on the lengths of filenames and things like that...

>
>> It's explicit and I see  all the memory that I need. I can even imagine
>> a simple class that can do  this. Oh wait, here it is:
>>
>>   std::string filename = substr(file, 0,  255);
>>   fd = creat(filename.c_str(), 0666)
>
> You don't need create explicit temporary filename,
> C++ keeps it alive as long as create is not completed.

I don't need to but I *can* which makes it explicit and clear that:

1) I am linearizing an immutable string.
2) I do need a linearized buffer.

>>
>> All I have to do is  to create a conversion operator to std::string and
>> I'll be fine. Or,  actually:
>>
>>   fd = creat(static_cast<std::string>(substr(file,  0, 255)).c_str(), 0666);
>>
>> Now here file could be a view, could be a raw  underlying string.
>>
>
> As above. throws and extreamly verbose.
>

This would throw because of... an out of memory exception? At that
point then all bets are off.

And verbosity is exactly what you want when making explicit
conversions and explicit operations.

>>
>> And then the problem is not addressed then of the  unnecessary contiguous
>>buffer.
>>
>
> There is good idea to have some non-linear data storage but
> it should used in very specific cases.
>
> Also what is really large string for you that would have
> performance advantage not being stored lineary.
>
> Talk to me in numbers?
>

Say on a machine+OS combo that has 4kb pages, a large string would be
something that spans more than one memory page -- i.e. >4kb.

Now a "short" string is one that can fit within a page. For
concatenated strings, it's fine to compact/copy the substrings into a
growable/shrinkable block. The overhead for a short string would be
constant, as the concatenation tree would be a pointer and a length
with a reference count integer.

If your string (whether long or short) is copied around, then
reference counts will kick in on copy constructors and destructors --
how shared_ptr does it. This means there's no extra memory allocations
(aside from the actual string object which would be a pointer's
width).

Now your long strings become interesting. For a string that spans more
than one page, you have the overhead of potentially a page's worth of
the concatenation tree data structure -- so if a tree node will
contain three pointers (block pointer, pointer to left node, pointer
to right node), a reference count (inner nodes can be referred to by
other higher level concatenation nodes), and a length, then doing some
math on it (12bytes+4bytes+2bytes = 18bytes) you have 1 page overhead
for every 220-page long string if you pack these 220 nodes in a single
page (which is really, an array-based tree).

What you get in return is: constant-time dereference bidirectional
iterators, logarithmic time 'char' random access, cheap copies, and
thread safety.

>> >
>> > What you are suggesting has  noting to do with text,
>> > and I don't understand how do you fail to see  this.
>> >
>>
>> I don't know if you're not a native English speaker or  whether you
>> just really think strings are just for text.
>
> There was other great answer by David Bergman for this
> you had probably already read
>
> I can only say +1 to his answer can't be told better.
>

Okay.

-- 
Dean Michael Berris
about.me/deanberris

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk