Boost logo

Boost :

Subject: Re: [boost] [Tokenizer]Usage and documentation
From: Stephan T. Lavavej (stl_at_[hidden])
Date: 2011-02-09 09:46:03


[Max]
> I'm using boost::tokenizer to do some simple parsing of data file in a format specified by the following rules:
> - One record of several fields in a single line
> - Adjacent data fields in a record separated by space char's(space or tab), with or without ","
> - String without space(s), with or without quotation marks
> - String with space(s), with quotation marks
> One example of a 4-field-per-record file is like:
> "string 2" 3 4 5 4.3
> "String", 2, 3.04 4 3
> AnyOtherText, 2, 3.04 4 3

> I've indeed tried with boost.Regex, on a slightly different path though - I
> was using boost::regex_search instead.

Never call regex_search() in a loop by incrementing iterators - doing so can trigger infinite loops and incorrect results. Read Pete Becker's TR1 book for the gory details (consider what happens with zero-length matches, for example). Always use regex_iterator or regex_token_iterator instead.

> But I still cannot understand it, after reading through
> http://www.boost.org/doc/libs/1_45_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html
> The part I could not interpret is:
> ^|[\s,]
> And
> $|[\s,]

The docs say:

> A '^' character shall match the start of a line.
> A '$' character shall match the end of a line.

It depends on how strict you want to be (see the unusual examples below, especially involving empty fields). One approach is to describe the fields you're interested in, and let regex_iterator find them. (Another approach, activating regex_token_iterator's magical field splitting ability, doesn't seem to be applicable here because you want to handle quoted strings - if I'm wrong about that I'd love to find out). I suggest the following (I've used VC10 RTM std::regex here, but boost::regex will behave identically):

C:\Temp>type meow.cpp
#include <iostream>
#include <ostream>
#include <regex>
#include <string>
#include <vector>
using namespace std;

int main() {
    const string reg("\"([^\"]*)\"|([^\\s,\"]+)");
    const regex r(reg);

    cout << "r: " << reg << endl << endl;

    for (string s; getline(cin, s); ) {
        if (s == "bye") {
            break;
        }

        vector<string> v;

        for (sregex_iterator i(s.begin(), s.end(), r), end; i != end; ++i) {
            const smatch& m = *i;

            v.push_back(m[1].matched ? m[1] : m[2]);
        }

        for (vector<string>::const_iterator i = v.begin(); i != v.end(); ++i) {
            cout << "[" << *i << "]";
        }

        cout << endl << endl;
    }
}

C:\Temp>cl /EHsc /nologo /W4 meow.cpp
meow.cpp

C:\Temp>meow
r: "([^"]*)"|([^\s,"]+)

"string 2" 3 4 5 4.3
[string 2][3][4][5][4.3]

"String", 2, 3.04 4 3
[String][2][3.04][4][3]

AnyOtherText, 2, 3.04 4 3
[AnyOtherText][2][3.04][4][3]

commas,without,spaces,"and","cute fluffy kittens"
[commas][without][spaces][and][cute fluffy kittens]

  leading whitespace and (invisible) trailing whitespace
[leading][whitespace][and][(invisible)][trailing][whitespace]

empty "" quotes
[empty][][quotes]

really"bizarre"strings"like""this"
[really][bizarre][strings][like][this]

empty,,,fields, , , like this
[empty][fields][like][this]

bye

C:\Temp>

Stephan T. Lavavej
Member of the Society for Regex Simplicity, I mean, Visual C++ Libraries Developer


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk