Boost logo

Boost :

From: John Maddock (john_at_[hidden])
Date: 2005-07-14 07:32:15


Apologies, I'm still catching up with this thread.

>>!!! "provided that this would not be an empty string" !!!

Correct.

>> How about this string: "/abc/abc".
>> Would this result in "", "abc", "abc"?
> Yes
>
>> Yet "abc/abc/" would result in "abc", "abc"?
> Yes
>
>> That seems terribly unbalanced to me, and this is not the behavior I
> would expect.
> Yes, you may have a point here.
>
> Or is it somewhat modeled after the C++ initializer syntax:
> { a, b, } is the same as { a, b }
> but { , a, b } isn't the same ...
>
> Maybe John can commence?

The original rational was "do the same thing as perl", for example:

perl -e "print join(':', split(/;/, '')) .\"\\n\". join(':', split(/;/,
';')) .\"\\n\". join(':', split(/;/, '1;2')) .\"\\n\". join(':', split(/;/,
'1;2;')) .\"\\n\". join(':', split(/;/, ';1;2;'))"

Outputs:

1:2
1:2
:1:2

Note no trailing blank fields, the Perl manual says:

" split /PATTERN/,EXPR,LIMIT
       split /PATTERN/,EXPR
       split /PATTERN/
       split Splits a string into a list of strings and returns that list.
               By default, empty leading fields are preserved, and empty
               trailing ones are deleted."

It also kind of makes sense to me: if you want to split on a delimiter, then
a trailing delimiter does not normally mean you want a trailing blank field:
indeed trailing delimiters are quite commonly used (think C++ array syntax
as one example).

I believe in Perl you can get the empty trailing field if you specify an
arbitrarily large argument as the split field limit.

As far as Boost.Regex is concerned, regex_token_iterator could be used to
get either behaviour given either definition as the starting point with
equivalent ease:

Given

tyedef boost::regex_token_iterator< ...args... > iterator_type;
iterator_type i( ...args... );

Then given the current behaviour (stripping a trailing empty field not
followed by a delimiter):

We know that a trailing field has been stripped if:

(i++ == iterator_type()) && (i->second != end_of_string_sequence)

Alternatively, if trailing empty fields were to be preserved, then we could
spot them when they happen with:

(i++ == iterator_type()) && (i->first == i->second)

So for me, the question is which behaviour is more commonly required?

At present I can't think of any real world use cases where a trailing empty
field would be important, so here's the challenge: can anyone think of a
file format, or transmission format or command line syntax or whatever where
the trailing field is actually required? Real world cases only please, but
first two data points:

CSV files, and the Unicode character database don't require the output of
trailing blanks (and parsing of the latter would certainly break if they
were considered).

One more very unscientific data point: historically this has always been the
behaviour of regex_split (now deprecated), and its replacement
regex_token_iterator, and no one has ever complained: until now that is! :-)

Still sticking to my guns for now.... John.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk