Boost logo

Boost Users :

Subject: Re: [Boost-users] regex iterator question
From: Mike Marchywka (marchywka_at_[hidden])
Date: 2009-03-30 13:04:06


----------------------------------------
> To: boost-users_at_[hidden]
> Date: Mon, 30 Mar 2009 12:48:48 -0400
> From: dfs_at_[hidden]
> Subject: Re: [Boost-users] regex iterator question
>
>
> In message , "Robert Ramey" writes:
>>Thinking about it, this problem must come very often. How is it usually
>>addressed? There must be a simple bridge across this. In a pinch, I'll
>>just have to load the whole file into some sort collection, but I prefer the
>>ultimate unlimited file size solution.
>

I always like unbounded solutions too but
if you try to stream a file past a regex-er it is
likely to be slow, and as pointed out by others,
not even reasonable in the general case although
you may want to think about specialized cases
that may benefit from any restricted set of regex'es
you have.

You would have to think about "strategies" or similar notions
that look at the problem and pick an approach
or specific implementation based on parameters.
I originally came here to do 1000's of REGEX
queries on megabyte strings and ultimately
used Boost and Greta for testing but quickly found
ways to compile query/sample vectors and implement
restricted searches once I found all the not-so-regular
expressions fit a given constraint or could even do
simple things like sorts to preserve locality later.

There are a lot of potential performance limitations depending
on the specific task parameters and machine. But, yes,
it would be nice if someone had a general "strategy"
library. LOL.

> In the worst case, if you're using Perl-style expressions (or any style
> that isn't strictly "regular" and requires backtracking; lookahead
> assertions are a common culprit), the entire input may have to be
> consumed and buffered even if the expression ultimately matches only a
> few characters (see "On the Use of Regular Expressions for Searching Text",
> Clark and Cormack, ACM Transactions on Programming Languages and Systems,
> Vol 19, No. 3, pp 413-426.). Therefore, if you're dealing with small
> files, you may as well buffer the entire file in a char array and use
> regex_token_iterator. If you're dealing with large files,
> memory map the file instead.
>
> daniel
>
> _______________________________________________
> Boost-users mailing list
> Boost-users_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/boost-users

_________________________________________________________________
Internet Explorer 8 – Get your Hotmail Accelerated. Download free!
http://clk.atdmt.com/MRT/go/141323790/direct/01/


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net