From: Andrey Semashev (andysem_at_[hidden])
Date: 2007-06-23 17:47:06
Jeremy Maitin-Shepard wrote:
> Andrey Semashev <andysem_at_[hidden]> writes:
>> Jeremy Maitin-Shepard wrote:
>>> Andrey Semashev <andysem_at_[hidden]> writes:
>> There may be different parsing techniques, depending on the text format.
>> Sometimes only character iteration is sufficient, in case of forward
>> sequential parsing. There is no restriction, though, to perform
>> non-sequential parsing (in case if there is some table of contents with
>> offsets or each field to be parsed is prepended with its length).
> Such a format would likely then not really be text, since it would
> contain embedded offsets (which might likely not be text).
Why not? See GCC symbols mangling for example.
>> If all standard algorithms and classes assume that the text being parsed
>> is in Unicode, it cannot perform optimizations in a more efficient
>> manner. The std::string or regex or stream classes will always have to
>> treat the text as Unicode.
> Well, since std::string and boost::regex already exist and do not assume
> Unicode (or even necessarily support it very well; I've seen some
> references to boost::regex providing Unicode support, but I haven't
> looked into it), that is not likely to occur.
Actually, std::string (or basic_string) does not support Unicode since
it operates on per-value_type basis. IOW, it won't recognize code
sequences. Same thing with streams. As for Boost.Regex, it has such
support, but it is optional (i.e. it allows 1-octet fixed width strings
for processing). And I believe, it is the way to do in other components
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk