Boost logo

Boost Users :

From: Lynn Allan (l_d_allan_at_[hidden])
Date: 2006-04-07 02:14:02


"John Maddock" <john_at_[hidden]> wrote in message
news:010801c6595d$74466980$901c0952_at_fuji...
>> In the snippet below, it works ok to loop thru m[num].matched, but I
>> was wondering if there was a more direct way of getting the
>> information about which match was "hit". (The real application is
>> going to have about 70 possible subexpressions to match against,
>> instead of just four.)
>
> 70!!! It wasn't really designed that kind of useage, and there isn't a
> simple way to get say the lowest index sub-expression that was matched:
> even
> with a full knowledge of the implementation it's not clear to me that it's
> possible to do better than an O(N) algorithm.

Thanks for the reply.

It's actually a lot more than 70 total sub-expressions, but 66 "groups" ....
the application is recognizing Bible verse references and emitting a code.

Genesis 1:1 becomes <xb=1.1.1>Genesis 1:1</xb> and
Gen 1 becomes <xb=1.1.1>Gen 1</xb>
The chapter number needs to be specified, but the verse number is optional
(after the colon).

Here is the actual MOAR (mother of all regexes):

((Gen|Ge|Genesis)|
(Exo|Ex|Exodus|Exod)|
(Le|Leviticus|Lev)|
(Nu|Num|Numbers)|
(De|Deuteronomy|Dt|Deut)|
(Jos|Josh|Joshua)|
(Jdg|Judges|Judg)|
(Ru|Ruth)|
(1 Sa|1Sa|1 Samuel|1 Sam|1Sam|1 Kingdoms|I Samuel|First Samuel|1st Samuel)|
(2 Sa|2Sa|2 Samuel|2 Sam|2Sam|2 Kingdoms|II Samuel|Second Samuel|2nd
Samuel)|
(1 Ki|1Ki|1 Kings|1 Kgs|1 Kin|1 Ki|1Kings|3 Kingdoms|I Kings|First Kings|1st
Kings)|
(2 Ki|2Ki|2 Kings|2 Kgs|2 Kin|2 Kgs|II Kings|IIKings|2Kings|Second Kings|2nd
Kings|4 Kingdoms)|
(1 Ch|1Ch|1 Chronicles|1 Chr|1 Chron|1Chron|I Chronicles|IChronicles|First
Chronicles|1st Chronicles)|
(2 Ch|2Ch|2 Chronicles|2 Chr|2 Chron|2Chron|II
Chronicles|IIChronicles|Second Chronicles|2nd Chronicles)|
(Ezr|Ezra)|
(Ne|Neh|Nehemiah)|
(Es|Esther|Esth)|
(Jb|Job)|
(Ps|Psalm|Psa|Psalms)|
(Pr|Proverbs|Pro|Prov)|
(Ec|Ecc|Ecclesiastes|Eccl|Eccles|Ecclesiastics)|
(Cant|Song|Song of Songs|Song of Solomon|Song of Sol)|
(Is|Isaiah|Isa)|
(Je|Jeremiah|Jer)|
(La|Lamentations|Lam)|
(Ez|Ezekiel|Ezk|Ezek)|
(Da|Daniel|Dan)|
(Ho|Hosea|Hs|Hos)|
(Jl|Joel)|
(Am|Amos)|
(Ob|Obadiah|Obad)|
(Jnh|Jonah)|
(Mi|Micah|Mic)|
(Na|Nahum|Nah)|
(Hab|Habakkuk)|
(Zep|Zephaniah|Zeph)|
(Hag|Haggai)|
(Zech|Zec|Zechariah)|
(Mal|Malachi)|
(Mat|Matthew|Matt|Mt)|
(Mar|Mark|Mk)|
(Lk|Luke|Luk)|
(Jn|John|Joh)|
(Ac|Acts|Act)|
(Rm|Romans|Rom|Roman)|
(1Cor|1 Co|1Co|1 Corinthians|1 Corin|1Corin|1 Cor|I Corinthians|I.
Corinthians|First Corinthians|1st Corinthians)|
(2Cor|2 Co|2Co|2 Corinthians|2 Corin|2Corin|2 Cor|II Corinthians|II.
Corinthians|Second Corinthians|2nd Corinthians)|
(Ga|Galatians|Gal)|
(Ep|Ephesians|Eph)|
(Phil|Philippians|Php|Philipians|Phillipians)|
(Col|Colossians)|
(1 Th|1Th|1 Thessalonians|1Thessalonians|1 Thess|1 Thes|1Thess|1Thes|I
Thessalonians|First Thessalonians|1st Thessalonians)|
(2 Th|2Th|2 Thessalonians|2Thessalonians|2 Thess|2 Thes|2Thess|2Thes|II
Thessalonians|Second Thessalonians|2nd Thessalonians)|
(1 Ti|1Ti|1 Timothy|1Timothy|1Tim|1 Tim|I Timothy|First Timothy|1st
Timothy)|
(2 Ti|2Ti|2 Timothy|2Timothy|2Tim|2 Tim|II Timothy|Second Timothy|2nd
Timothy)|
(Ti|Titus|Tit)|
(PHILEMON|Phile|Philemon|Phlm|Philm)|
(Hb|Hebrews|Heb|Hebrew)|
(Ja|James|Jas|Jms)|
(1 Pe|1Pe|1 Peter|1 Pet|I Peter|IPeter|First Peter|1st Peter)|
(2 Pe|2Pe|2 Peter|2 Pet|II Peter|IIPeter|Second Peter|2nd Peter)|
(1 Jn|1Jn|1 John|1 Joh|1Joh|1John|I John|IJohn|First John|1st John)|
(2 Jn|2Jn|2 John|2 Joh|2Joh|2John|II John|IIJohn|Second John|2nd John)|
(3 Jn|3Jn|3 John|3 Joh|3Joh|3John|III John|IIIJohn|Third John|3rd John)|
(Jde|Jude)|
(Apoc|Revelation|Rev|Revel|Apocrypha|Revelation of John|Rev of John|Rev. of
John))
\.? ?([0-9]{1,3})(:([0-9]{1,3}))?

> You could I suppose group the sub-expressions together so you could use
> something more like a binary search to find the one that matched:
>
> ((a|b|c)|(d|e|f))
>
> Then
>
> if(my_match[1].matched)
> {
> // must be a, b or c
> }
> else
> {
> // must be d e or f.
> }

My impression is that I'm already doing something like you suggest. The code
doesn't need to know which of the three variants of Genesis was matched,
just that it was from the Genesis "group". The adaption of the "Captures"
code works well. Thanks for providing it.

Obviously easier said than done, but if the library can turn the appropriate
m[index].matched to "true", then perhaps there could be another variable in
the class to reflect the lowest index of the first group that is true. I
realize this gets more complicated because there can be multiple groups.

> Sorry I can't be more helpful, but it's tempting to suggest that you use
> the
> spirit parser libary rather than regexes.

I'll take a look at it .... I'm really impressed with your boost::regex
library, and am inclined to continue using it. Thanks for making it
available.


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net