Boost logo

Boost Users :

Subject: Re: [Boost-users] Regex - Reading CSV files like MS Excel does
From: Larry (lknain_at_[hidden])
Date: 2008-11-14 16:15:11


I experimented with the tokenizer at one time. The escaped tokenizer worked
after a fashion. If you don't have null elements
(e.g.,....,item1,,item3,.....) it is not too bad. However if you define

CharTokens cs(boost::keep_empty_tokens);
EscapedTokenizer et(buffer,cs);

you can get the empty tokens. The downside is you also get the separators as
I recall but you can skip over the separators as you iterate over the
result. I think if you iterate with EscapedTokenizer::iterator you will
still get the surrounding quotes for any item that is quoted.

Another heavyweight scheme is using Spirit. I found an early version of a
test that I think worked although I would be hard-pressed at the moment to
explain it.

    string buffer("\"string\",\"string with an embedded
\\\"\",123,0.123,,2"); // example string for parsing
    rule<> list_csv, list_csv_item;
    vector<string> vec_list;
    list_csv_item =
        confix_p('\"', *c_escape_ch_p, '\"')
             | longest_d[real_p | int_p]
             ;
    list_csv = list_p(
              !list_csv_item[append(vec_item)],
              ','
              )[append(vec_list)]
              ;
    parse_info<> result = parse(buffer,list_csv);
    if (result.hit) {
        // Got something
        if (result.full) {
              // Complete
        } else {
             // Not quite everything
        }
        // iterate through vec_list or vec_item for data
    } else {
        // Didn't parse
    }

With very simple csv files you could use split() from string_algo.

     vector<string> sv;

     split(sv,buffer,is_any_of(","));

You end up with a vector of the elements. A null item will be present in the
vector. Embedded separators won't work in this scheme as I recall.

There may be easier schemes. I don't recall seeing a regex scheme. Spirit I
think actually uses the Tokenizers under the covers and can use regex in
some cases.

Larry
----- Original Message -----
From: "Jeff Dunlap" <jeff_j_dunlap_at_[hidden]>
To: <boost-users_at_[hidden]>
Sent: Friday, November 14, 2008 2:34 PM
Subject: Re: [Boost-users] Regex - Reading CSV files like MS Excel does

> I've been looking at those googled expressions and they don't seem to
> handle the CSV file the way they should.
>
> I'll continue experimenting or use Boost.Tokenizer for this purpose as
> suggested by Eric Malenfant.
>
> Thanks
>
>
> "Jeff Flinn" <TriumphSprint2000_at_[hidden]> wrote in message
> news:gfk7o3$3mc$1_at_ger.gmane.org...
>> Jeff Dunlap wrote:
>>> The expression regex e(",") handles files in a clean format such as:
>>>
>>> field 1,field 2,field 3
>>>
>>>
>>>
>>> I would like to read some CSV files where there may be fields that
>>> contain a comma (enclosed in ""):
>>>
>>> 1999,Smith, Mike, "Smith, Mike", 55
>>> 1999,Doe, Jane, "Doe, Jane", 45
>>>
>>>
>>>
>>> And if possible, I would like to handle commas and quotes within the
>>> field:
>>>
>>> 1999,Doe, Jane, "Doe, Jane "Happy Gurl"", 45
>>
>> googling regex csv yields several hits the first of which is:
>>
>> http://geekswithblogs.net/mwatson/archive/2004/09/04/10658.aspx
>>
>> Jeff
>
>
>
>
> --
>
>
> ELKNews FREE Edition - Empower your News Reader!
> http://www.atozedsoftware.com
>
>
> _______________________________________________
> Boost-users mailing list
> Boost-users_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/boost-users
>


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net