Boost logo

Boost Users :

Subject: Re: [Boost-users] [rfc] a library for gesture recognition, speech recognition, and synthesis
From: Stjepan Rajko (stjepan.rajko_at_[hidden])
Date: 2009-10-26 18:21:31


On Fri, Oct 23, 2009 at 10:18 AM, Roland Bock <rbock_at_[hidden]> wrote:

> Stjepan Rajko wrote:
>
> On Thu, Oct 22, 2009 at 2:04 AM, Roland Bock <rbock_at_[hidden] <mailto:
>> rbock_at_[hidden]>> wrote:
>>
>> Stjepan Rajko wrote:
>>
>> Hello,
>>
>> For the past few years I've been working on the AME Patterns
>> library - a generic library for modeling, recognition and
>> synthesis of sequential patterns.
>> ...
>>
>>
>> Support for Hidden Markov models would be of interest to me. I hope
>> to start some part-of-speech-tagging and similar analysis in about
>> half a year.
>>
>>
>> That's a neat problem that I haven't tried yet. I downloaded the Brown
>> corpus and will try to get some results on part-of-speech-tagging.
>>
>> Thanks for your interest,
>>
>> Stjepan
>>
>
> Wow! Keep me posted, please :-)
>
>
OK, I just completed a small experiment on the 9 texts of the Brown Corpus
categorized as "humor". I used 6 of the texts for training, and 3 for
testing.

I created one submodel per tag (
http://kh.aksis.uib.no/icame/manuals/brown/INDEX.HTM#bc6), trained each from
the training data, and then connected the submodels into a larger model with
transitions also trained by the training data.

Here are the results:

Out of 7159 tagged parts of speech (words, symbols, etc.) present in the 3
test texts:
5190 were tagged correctly
300 were tagged incorrectly
1669 were not tagged, because the word or symbol was not present (at least
not in a verbatim form) in the training data.

So, if you only consider the 7159-1669=5490 parts that could possibly be
tagged based on what the training data covers, you get a 94.5% success rate.

By using a larger training set, the number of non-tagged parts should go
down. Also, I'm sure there are domain-specific tricks to improving the
results.

BTW., 95% of work to get this done was putting together the code that reads
the corpus, since I already have generic code that does this kind of
experiment.

> I hope to join in a few months...
>
>
Great! I hope to have things cleaned up and better documented by then.

Best,

Stjepan



Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net