Boost logo

Boost :

Subject: Re: [boost] tokenizer for overlapping delimiters?
From: Jorge Cardoso Leitão (jorgecarleitao_at_[hidden])
Date: 2015-04-25 06:24:50


Hi Ganesh,

I have two main use cases in mind:

1. text interpretation, which is what I've been using it for. The
motivation is that some composite words define terms that the user may want
to use as tokens. At some level of the interpretation, the analyser must
interpret composite words as single tokens, which exactly what this
interface does.

2. search indexes: traditionally, search indexes (e.g. Xapian, Sphinx)
consider single words to be independent tokens and store their relative
position on the text to allow exact matches of composite words (e.g. they
allow queries like "the PHRASE end" to say that the query wants the words
together). This query is naturally less efficient than having the composite
word indexed. Firstly because it requires having the positional information
and secondly because it requires more than a single lookup to the inverted
index. With this interface, composite words are indexed without indexing
its individual words, which is useful when the individual words are not
informative for a search, but the composite word is.

I agree that regexes are important here. I'm not sure regexes fulfill the 4
constraints above, as there may be no single way to define what is the best
match for two key-terms with greedy regexes. I.e. tokenise("foo bar", {"fo.
ba", "fo.*"}) should return {"foo ba", "r"} or {"foo bar"}? I could try to
formalize the constraints and the algorithm to include regexes.

Regards,
Jorge

On Thu, Apr 23, 2015 at 2:17 PM, Ganesh Prasad <sir.gnsp_at_[hidden]> wrote:

> Hi Jorge,
> Can you please provide some ideas about the use-cases ? I seem to get your
> point, but I can not visualize its usefulness in construction of compilers
> and similar things. Also, should not generic tokenizers take RegEx patterns
> ?
> It would be great if you could elaborate it in greater details.
>
> Best Wishes
> Ganesh Prasad
>
> On 23 April 2015 at 16:00, Jorge Cardoso Leitão <jorgecarleitao_at_[hidden]>
> wrote:
>
> > Dear boost devs,
> >
> > I'm writing here because I've been coding a tokenizer that I believe,
> given
> > its generality, could be an addition to Boost. I'm asking for a judgement
> > on whether the idea has room in boost or not.
> >
> > The interface I'm proposing is a function of two arguments, a string
> (call
> > it text) and a set of strings (call it key-terms), that returns a
> > vector<string> (call it result) which fulfils 4 constraints:
> >
> > 1. a string in the result can only be either a key-term or a string
> > between two key-terms;
> > 2. the concatenation of the result is always the original text;
> > 3. a key-term containing other key-terms has priority over the latter;
> > 4. a key-term overlapping other has priority based on its position in
> > the text
> >
> > A tokenizer that divides a string by delimiters is a subset of this
> > interface whose key-terms are the delimiters. This is considered in
> > Boost.Tokenizer, where the key-terms are *non-overlapping. *The critical
> > addition here is the ability to deal with *overlapping key-terms*.
> >
> > A common use case of overlapping key-terms is when you have key-terms
> that
> > you want to consider as single tokens but they overlap with common
> > delimiters. A practical example:
> >
> > tokenize a string (with words separated by spaces) and guarantee that
> both
> > `"United States of America"` and `"United States of Brazil"` are
> > interpreted as single tokens.
> >
> >
> > The non-triviality appears because such feat requires storing which
> > key-terms are using a previous sub-string, and how to reverse in case the
> > match fails. (e.g. "United States of " is common to both terms above, but
> > once the first letter appears, either one or both can be discarded as
> > potential matches).
> >
> > Some examples in pseudo-code (see how they fulfil constraints 1-4)
> >
> > tokenize("the end", {}) --> {"the end"}
> > tokenize("the end", {" "}) --> {"the", " ", "end"}
> > tokenize("foo-bar", {"foo", "foo-bar"}) --> {"foo-bar"}
> > tokenize("the end", {"the e", " ", "end"}) --> {"the e", "nd"}
> > tokenize("foo-tars ds", {"foo", "foo-bar", "-tar"}) --> {"foo", "-tar",
> "s
> > ds"}
> >
> > As a proof of concept, I've implemented such interface and respective
> test
> > cases, which you can find in https://github.com/jorgecarleitao/tokenize.
> > Any change is possible to accommodate to boost standards: this can be
> > generalized to arbitrary sequences, to arbitrary types, and to use
> > iterators, documented, better tested etc.
> >
> > But before anything else, I would like to ask for an opinion on whether
> > this is sufficiently general and useful to be considered to boost.
> >
> > Thank you for your time,
> > Best regards,
> > Jorge
> >
> > _______________________________________________
> > Unsubscribe & other changes:
> > http://lists.boost.org/mailman/listinfo.cgi/boost
> >
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk