Boost logo

Boost :

Subject: Re: [boost] tokenizer for overlapping delimiters?
From: Ganesh Prasad (sir.gnsp_at_[hidden])
Date: 2015-04-23 08:17:41

Hi Jorge,
Can you please provide some ideas about the use-cases ? I seem to get your
point, but I can not visualize its usefulness in construction of compilers
and similar things. Also, should not generic tokenizers take RegEx patterns
It would be great if you could elaborate it in greater details.

Best Wishes
Ganesh Prasad

On 23 April 2015 at 16:00, Jorge Cardoso Leitão <jorgecarleitao_at_[hidden]>

> Dear boost devs,
> I'm writing here because I've been coding a tokenizer that I believe, given
> its generality, could be an addition to Boost. I'm asking for a judgement
> on whether the idea has room in boost or not.
> The interface I'm proposing is a function of two arguments, a string (call
> it text) and a set of strings (call it key-terms), that returns a
> vector<string> (call it result) which fulfils 4 constraints:
> 1. a string in the result can only be either a key-term or a string
> between two key-terms;
> 2. the concatenation of the result is always the original text;
> 3. a key-term containing other key-terms has priority over the latter;
> 4. a key-term overlapping other has priority based on its position in
> the text
> A tokenizer that divides a string by delimiters is a subset of this
> interface whose key-terms are the delimiters. This is considered in
> Boost.Tokenizer, where the key-terms are *non-overlapping. *The critical
> addition here is the ability to deal with *overlapping key-terms*.
> A common use case of overlapping key-terms is when you have key-terms that
> you want to consider as single tokens but they overlap with common
> delimiters. A practical example:
> tokenize a string (with words separated by spaces) and guarantee that both
> `"United States of America"` and `"United States of Brazil"` are
> interpreted as single tokens.
> The non-triviality appears because such feat requires storing which
> key-terms are using a previous sub-string, and how to reverse in case the
> match fails. (e.g. "United States of " is common to both terms above, but
> once the first letter appears, either one or both can be discarded as
> potential matches).
> Some examples in pseudo-code (see how they fulfil constraints 1-4)
> tokenize("the end", {}) --> {"the end"}
> tokenize("the end", {" "}) --> {"the", " ", "end"}
> tokenize("foo-bar", {"foo", "foo-bar"}) --> {"foo-bar"}
> tokenize("the end", {"the e", " ", "end"}) --> {"the e", "nd"}
> tokenize("foo-tars ds", {"foo", "foo-bar", "-tar"}) --> {"foo", "-tar", "s
> ds"}
> As a proof of concept, I've implemented such interface and respective test
> cases, which you can find in
> Any change is possible to accommodate to boost standards: this can be
> generalized to arbitrary sequences, to arbitrary types, and to use
> iterators, documented, better tested etc.
> But before anything else, I would like to ask for an opinion on whether
> this is sufficiently general and useful to be considered to boost.
> Thank you for your time,
> Best regards,
> Jorge
> _______________________________________________
> Unsubscribe & other changes:

Boost list run by bdawes at, gregod at, cpdaniel at, john at