Boost logo

Boost :

Subject: Re: [boost] tokenizer for overlapping delimiters?
From: Ganesh Prasad (sir.gnsp_at_[hidden])
Date: 2015-04-23 08:17:41


Hi Jorge,
Can you please provide some ideas about the use-cases ? I seem to get your
point, but I can not visualize its usefulness in construction of compilers
and similar things. Also, should not generic tokenizers take RegEx patterns
?
It would be great if you could elaborate it in greater details.

Best Wishes
Ganesh Prasad

On 23 April 2015 at 16:00, Jorge Cardoso Leitão <jorgecarleitao_at_[hidden]>
wrote:

> Dear boost devs,
>
> I'm writing here because I've been coding a tokenizer that I believe, given
> its generality, could be an addition to Boost. I'm asking for a judgement
> on whether the idea has room in boost or not.
>
> The interface I'm proposing is a function of two arguments, a string (call
> it text) and a set of strings (call it key-terms), that returns a
> vector<string> (call it result) which fulfils 4 constraints:
>
> 1. a string in the result can only be either a key-term or a string
> between two key-terms;
> 2. the concatenation of the result is always the original text;
> 3. a key-term containing other key-terms has priority over the latter;
> 4. a key-term overlapping other has priority based on its position in
> the text
>
> A tokenizer that divides a string by delimiters is a subset of this
> interface whose key-terms are the delimiters. This is considered in
> Boost.Tokenizer, where the key-terms are *non-overlapping. *The critical
> addition here is the ability to deal with *overlapping key-terms*.
>
> A common use case of overlapping key-terms is when you have key-terms that
> you want to consider as single tokens but they overlap with common
> delimiters. A practical example:
>
> tokenize a string (with words separated by spaces) and guarantee that both
> `"United States of America"` and `"United States of Brazil"` are
> interpreted as single tokens.
>
>
> The non-triviality appears because such feat requires storing which
> key-terms are using a previous sub-string, and how to reverse in case the
> match fails. (e.g. "United States of " is common to both terms above, but
> once the first letter appears, either one or both can be discarded as
> potential matches).
>
> Some examples in pseudo-code (see how they fulfil constraints 1-4)
>
> tokenize("the end", {}) --> {"the end"}
> tokenize("the end", {" "}) --> {"the", " ", "end"}
> tokenize("foo-bar", {"foo", "foo-bar"}) --> {"foo-bar"}
> tokenize("the end", {"the e", " ", "end"}) --> {"the e", "nd"}
> tokenize("foo-tars ds", {"foo", "foo-bar", "-tar"}) --> {"foo", "-tar", "s
> ds"}
>
> As a proof of concept, I've implemented such interface and respective test
> cases, which you can find in https://github.com/jorgecarleitao/tokenize.
> Any change is possible to accommodate to boost standards: this can be
> generalized to arbitrary sequences, to arbitrary types, and to use
> iterators, documented, better tested etc.
>
> But before anything else, I would like to ask for an opinion on whether
> this is sufficiently general and useful to be considered to boost.
>
> Thank you for your time,
> Best regards,
> Jorge
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
>


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk