Boost logo

Boost :

From: Arno Schoedl (schoedl_at_[hidden])
Date: 2024-02-23 04:15:59


That sounds good.

In our codebase (and many others?), for interoperability with APIs, char is used for UTF-8 and wchar_t for UTF-16 (on Windows). I don’t want to reinterpret_cast the input when calling the parser so it knows I am giving it Unicode. Do I have to?

P.S. I made my case against introducing char8_t into our codebase here: https://www.think-cell.com/en/career/devblog/char8_t-was-a-bad-idea


--
Dr. Arno Schödl
CTO
schoedl_at_[hidden]<mailto:schoedl_at_[hidden]> | +49 30 6664731-0

We are looking for C++ Developers: https://www.think-cell.com/developers

think-cell Software GmbH (Web site<https://www.think-cell.com>)
Leipziger Str. 51, 10117 Berlin, Germany
Main phone +49 30 6664731-0 | US toll-free +1 800 891 8091

Amtsgericht Berlin-Charlottenburg HRB 180042
Directors: Christoph Hobo, Dr. Arno Schödl

Please refer to our privacy policy<https://www.think-cell.com/privacy> on how we protect your personal data.

On Feb 23, 2024, at 00:50, Zach Laine via Boost <boost_at_[hidden]> wrote:

On Thu, Feb 22, 2024 at 1:36 PM Arno Schoedl via Boost
<boost_at_[hidden]> wrote:

Since the reviews of Boost.Parser are currently on, I wanted to share an insight we had at think-cell when working with Boost.Spirit. We have standardized on it for years for all custom parsing needs. Most parsers are small, but some are larger, like Excel expressions.
Of course, our input is mostly Unicode, either UTF-8 or UTF-16. Matching Unicode is complex. Comparison by code point is usually not the right thing. Instead, we must normalize, for which we even have various choices of what to accept as equal:
https://en.wikipedia.org/wiki/Unicode_equivalence
Case-insensitive matching is even more complex, slow, and even language-dependent.
Input is often not guaranteed to be valid Unicode. For example, file names on Windows are sequences of 16-bit units, allowing unmatched surrogates, same with input from Win32 edit boxes and file content.
But we realized that for almost all grammars we have, all this complexity does not matter. The reserved symbols of most grammars (JSON, XML, C++, URLs, etc.) are pure ASCII. Semantically relevant strings are ASCII as well („EXCEL.EXE“). ASCII can be correctly and quickly matched on a per-code-unit basis. Case-insensitive matching for ASCII is simple and fast. User-defined strings, such as JSON string values, may contain Unicode, but then they usually do not affect parsing decisions. The user may want Unicode validation for these strings, but this can be done by the leaf parser for these strings, rather than for the whole input.
Since so much matching is against ASCII, we found it useful to have compile-time known ASCII literals in the parser library. With them, the same grammar can be used for all input encodings. When parsing user-defined strings, they will have the encoding of the input, but that’s fine. Any encoding conversion can be dealt with separately from the parser.
Finally, we may want to parse more than just strings. Parsing binary files, or sequences of DNA, should be possible and efficient.
Thus I recommend separating Unicode processing from the parser library. The parser library operates on an abstract stream of symbols. For Unicode text these would be code units. It provides the composite parsers such as sequences with and without backtracking, alternatives, Kleene star etc., and leaves the interpretation of the symbols entirely to the leaf parsers, which may or may not care about Unicode.

I think all of the above sounds right to me too. This last part is a
decent description of how Boost.Parser actually does Unicode handling.
If the input is a range of char, there is on Unicode anything used,
anywhere. If it is char{8,16,32}_t, then the parse is in "Unicode
mode," and the parsers for which that would make a difference act
accordingly. The user can specify their parsers in ASCII, UTF-8,
UTF-16, or UTF-32, and the right thing will happen. So, if you put a
char_('a') in your parser, that's ASCII, and just works, Unicode or
not, because ASCII is a subset of Unicode. If you put char_(U'X') for
some Unicode character 'X', that also works, whether in Unicode mode
or not. If the input is ASCII, char_(U'X') just won't match any of
the chars being parsed. If the input is Unicode, whether a match
happens might require a transcoding operation (UTF-N -> UTF-M). You
never pay for what you don't use.

The only time this fails is if you use some non-Unicode-interoperable
encoding (say, EBCDIC) in your char_ parser, and then your input is
some charN_t. Then Boost.Parser will try to compare Unicode to
EBCDIC, which just won't work. I decided a long time ago I don't care
about such encodings. The user can still use them, as long as the
input is a range of chars in encoding E, and all the parsers'
characters and strings are also in E.

Zach

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk