Boost logo

Boost :

Subject: [boost] tokenizer for overlapping delimiters?
From: Jorge Cardoso Leitão (jorgecarleitao_at_[hidden])
Date: 2015-04-23 06:30:53


Dear boost devs,

I'm writing here because I've been coding a tokenizer that I believe, given
its generality, could be an addition to Boost. I'm asking for a judgement
on whether the idea has room in boost or not.

The interface I'm proposing is a function of two arguments, a string (call
it text) and a set of strings (call it key-terms), that returns a
vector<string> (call it result) which fulfils 4 constraints:

   1. a string in the result can only be either a key-term or a string
   between two key-terms;
   2. the concatenation of the result is always the original text;
   3. a key-term containing other key-terms has priority over the latter;
   4. a key-term overlapping other has priority based on its position in
   the text

A tokenizer that divides a string by delimiters is a subset of this
interface whose key-terms are the delimiters. This is considered in
Boost.Tokenizer, where the key-terms are *non-overlapping. *The critical
addition here is the ability to deal with *overlapping key-terms*.

A common use case of overlapping key-terms is when you have key-terms that
you want to consider as single tokens but they overlap with common
delimiters. A practical example:

tokenize a string (with words separated by spaces) and guarantee that both
`"United States of America"` and `"United States of Brazil"` are
interpreted as single tokens.

The non-triviality appears because such feat requires storing which
key-terms are using a previous sub-string, and how to reverse in case the
match fails. (e.g. "United States of " is common to both terms above, but
once the first letter appears, either one or both can be discarded as
potential matches).

Some examples in pseudo-code (see how they fulfil constraints 1-4)

tokenize("the end", {}) --> {"the end"}
tokenize("the end", {" "}) --> {"the", " ", "end"}
tokenize("foo-bar", {"foo", "foo-bar"}) --> {"foo-bar"}
tokenize("the end", {"the e", " ", "end"}) --> {"the e", "nd"}
tokenize("foo-tars ds", {"foo", "foo-bar", "-tar"}) --> {"foo", "-tar", "s
ds"}

As a proof of concept, I've implemented such interface and respective test
cases, which you can find in https://github.com/jorgecarleitao/tokenize.
Any change is possible to accommodate to boost standards: this can be
generalized to arbitrary sequences, to arbitrary types, and to use
iterators, documented, better tested etc.

But before anything else, I would like to ask for an opinion on whether
this is sufficiently general and useful to be considered to boost.

Thank you for your time,
Best regards,
Jorge


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk