Boost logo

Boost :

From: Preston A. Elder (prez_at_[hidden])
Date: 2005-04-26 11:42:00


Hey all,

I've been working on something to reverse boost::format, using regex
mainly.

So say you're working with IRC, and to create the server line, you use a
boost::format of:
SERVER %1% %2% :%3%

The expanded string might be:
SERVER test.neuromancy.net 1 :My test IRC server

To un-do it you would need a regex of:
SERVER ([-.[:alnum:]]+) ([[:digit:]]+) :(.*)

My unformatter (right now, part of my Mantra project) allows you to
specify the format string, and the expanded version of that string, and
populate a map with what each formatting token is (assuming you have
your formatting tokens numbered, it doesn't work if you don't).

It does handle formatting args (ie. %1$3.2f% or %1|-20s|%), however it
does not validate them, just handles them in the regex. It also allows
you to specify your own regex's to be used to read certain arguments, so
for example you could do:
mantra::unformat fmt;
fmt.ElementRegex(1, "[-.[:alnum:]]+");
fmt.ElementRegex(2, "[[:digit:]]+");

A 'default' unformatting regex may also be specified (it defaults to
".*") (which can be specified as a constructor argument, or using
DefaultRegex). I've also made a convenient way to do all of the above
more or less inline, inspired by boost::format, with:
(mantra::unformat(".*") % "[-.[:alnum:]]+" % "[[:digit:]]+")

My reason for posting this is, I wanted someone else to tell me if there
is a better way than the way I am using to do all of this, or is there a
better regex than what I'm using to pull out the formatting strings.

You can see the source at (search for basic_unformat):
http://www.neuromancy.net/viewcvs/Mantra-I/include/mantra/core/algorithms.h?root=mantra&rev=1.12&view=auto

You'll notice I use convert_string<C> a lot, this just converts between
character types (because basic_unformat, like basic_format or
basic_string can be passed an arbitary character type (though it usually
gets char or wchar_t), I need to be able to convert my const char *
regex strings into that character type so it will still work with
wchar_t).

The way it basically works now is in 5 steps.
1) It searches the format string for all the 'extended' %N% tags and
goes through and remembers the order they are in.
2) It replaces the 'extended' %N% tags with --<<N>>-- tags (the "--<<"
and ">>--" parts can be defined by the user if it conflicts with
something in the input).
3) It replaces all regex-specific characters in the string so that they
will not be evaluated when the input string is later used as a regex.
4) It replaces all the --<<N>>-- tags with either the user-defined
element regex, or the 'default' regex.
5) It evaluates the new format string (now a regex) against the
'expanded' string, and pulls out each element (using the order it
remembered from before). This step also checks to ensure that every
instance of the same element are the same (ie. %1% should be the same
every time it is used).

This is a little involved, but using regex, its still quite quick.

As always, comments, corrections, suggestions, etc. are appreciated, and
in this case, solicited :)

If you want to yoink the code for this for your own purposes, go ahead.

PreZ :)


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk