|
Boost : |
Subject: Re: [boost] [Tokenizer]Usage and documentation
From: Yechezkel Mett (ymett.on.boost_at_[hidden])
Date: 2011-02-10 04:40:59
On Thu, Feb 10, 2011 at 9:32 AM, Max <more4less_at_[hidden]> wrote:
[Stephan T. Lavavej <stl_at_[hidden]> wrote:]
>> [Max]
>> > The part I could not interpret is:
>> > ^|[\s,]
>> > And
>> > $|[\s,]
>>
>> The docs say:
>>
>> > A '^' character shall match the start of a line.
>> > A '$' character shall match the end of a line.
>
> Yes, I'm aware of this. But even with this in mind, I cannot interpret
> "^|[\s,]" and "$|[\s,]".
> For the former, I know '|' means alteration, but how can it be after '^'?
> For the latter, how can "|[\s,]" be expected after the end of a line (and
> the same confusion as above)?
^|[\s,]
means _either_ the beginning of the line _or_ a space or comma. In
other words the field starts either at the beginning of the line or
after a space or comma.
Likewise
$|[\s,]
The field ends either at the end of the line or before a space or comma.
> One more question - with you code, any empty 'token' between two contiguous
> ',' is ignored, what if someday I'd like to pick them up?
"([^"]*)"|([^\s,"]+)|,\s*(),|^\s*(),|,\s*()$
I'm presuming an empty line should count as no tokens; if you don't
mind an empty line being one token it can be simplified to
"([^"]*)"|([^\s,"]+)|(?:^|,)\s*()(?:$|,)
Not really that much simpler.
Yechezkel Mett
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk