Boost logo

Boost :

Subject: Re: [boost] [gsoc]built-in support for dictionary words
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2009-03-28 11:52:26


kannan venkat wrote:
> I plan to provide
> *support for list of all meaningful words .
> *efficient methods to search if a string is a valid english word.
> *advanced searched options for checking if a given string is a substring is
> of any valid english word in an efficient manner.
> *methods to check if any anagram of a given string is a valid english word.
>
> could you tell me if this will be a useful contribution & give suggestions
> if there can be any other useful feature related to this..

I do not think it's very interesting if you limit that to English. This
should work with all languages.

Now, with any language, representation and iteration of characters,
words, sentences as well as comparison and collations are all
not-so-trivial operations.
You would however need almost all of them to do what you want.

Handling natural language is really quite more complicated than handling
bytes.

Thankfully, the Unicode standard defines representation and a lot of
operations. The funny thing is that some languages, such as Thai,
actually require a dictionary to tell words apart from each other, since
there are no explicit word boundaries (alternatively, it can be done
using machine learning algorithms to percept word-like constructs, there
are quite a few research papers on that topic).

There has been a lot of demand for some Unicode library within Boost,
but those demands were never met.
ICU from IBM is a popular Unicode library, though, and several libraries
within Boost use it.

I would suggest you either base your work on ICU, or you re-implement
the parts from Unicode that you need.

As for features, I suggest you define a dictionary format that allows
concise definitions of words. Since in English, for example, most words
are actually some base word with prefixes and suffixes, you could simply
tell the dictionary to allow various combinations.
It may be a bad idea or not, I don't know, but I've found several times
that spell-checkers were able to recognize some word but not if I add a
valid suffix to it.

Your project actually gave me an idea: being myself a student, I will
propose an unicode string project.
Thank you ;)


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk