Boost logo

Boost Users :

Subject: Re: [Boost-users] Get All Words Offset in String using Boost regex
From: Anthony Foiani (tkil_at_[hidden])
Date: 2012-01-05 21:25:21


S Nagre <snagre.mumbai_at_[hidden]> writes:

> std::string escapeChar = "\\" ;
> std::string bChar = "b";
> std::string dotChar = ".";
>
> std::string findWordInStr = escapeChar + bChar + dotChar +
> escapeChar + bChar;

This ends up with the expression "\\b.\\b", which will only ever match
a single character with word break on either side (so, in your
example, it should match all and only the spaces):

> "Hello World and Google"
        ^ ^ ^

Closer would be "\\b.+?\\b", but that would still match on your spaces:

> "Hello World and Google"
   ^ ^^ ^^ ^^

If you really want words, you are best off deciding what constitutes a
word, and then writing the regex for exactly that purpose. There is
the built-in "\\w" character class, but only you can decide whether
things like apostrophes and hyphens break words. (And that's just in
English; I have no idea what constitutes word-break most other
languages!) For English, I'd consider something like "[\\w'-]+"
(which should be: all word chars, plus apostrophes, plus hyphens).

And from a personal taste point of view, I'd likely write it exactly
that way. (I do sometimes decompose my regexes, but only if they have
repeated subsections that could better be described as a variable
name.)

You also had a small logic error, when you wrote this:

           OffSetMap[foundPos] = foundLen;

"foundPos" is relative to the start of the last search, not to the
start of the whole string.

Here's my version:

| #include <map>
| #include <string>
|
| #include <boost/foreach.hpp>
| #include <boost/regex.hpp>
|
| typedef int int32;
|
| typedef std::map< int32, int32 > offset_map_t;
|
| void create_offset_map( const std::string & str,
| offset_map_t & offset_map )
| {
| std::cout << "searching '" << str << "'" << std::endl;
|
| boost::regex re( "[\\w'-]+" );
|
| boost::smatch what;
|
| std::string::const_iterator start = str.begin();
| std::string::const_iterator end = str.end();
|
| while ( boost::regex_search( start, end, what, re ) )
| {
| int32 pos = what.position();
| int32 len = what.length();
|
| std::cout << " found '" << what.str( 0 ) << "'"
| << " at pos=" << pos << ", len=" << len << std::endl;
|
| start += pos;
| offset_map[ start - str.begin() ] = len;
| start += len;
| }
|
| BOOST_FOREACH( const offset_map_t::value_type & p, offset_map )
| std::cout << " ( " << p.first << ", "
| << p.second << " )" << std::endl;
| }
|
| int main( int argc, char * argv [] )
| {
| for ( int i = 1; i < argc; ++i )
| {
| offset_map_t my_map;
| create_offset_map( argv[i], my_map );
| }
| return 0;
| }

Hope this helps.

Best Regards,
Anthony Foiani


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net