Boost logo

Boost :

From: Daniela Engert (dani_at_[hidden])
Date: 2024-02-23 08:04:09


Am 23.02.2024 um 05:15 schrieb Arno Schoedl via Boost:
> That sounds good.
>
> In our codebase (and many others?), for interoperability with APIs, char is used for UTF-8 and wchar_t for UTF-16 (on Windows).

I think this is a practical choice and works well. We do the same, with
the additional advantage of targeting Windows only. So *everything* is
Unicode (with respective code units), and in the subset of the 7-bit
'char' range where ASCII and Unicode code points match value-wise, the
distinction doesn't really matter and everything is easy and agnostic of
encoding.

Thanks,
 Â  Dani

> I don’t want to reinterpret_cast the input when calling the parser so it knows I am giving it Unicode. Do I have to?
>
> P.S. I made my case against introducing char8_t into our codebase here: https://www.think-cell.com/en/career/devblog/char8_t-was-a-bad-idea
>
>
> --
> Dr. Arno Schödl
> CTO
> schoedl_at_[hidden]<mailto:schoedl_at_[hidden]> | +49 30 6664731-0
>
> We are looking for C++ Developers: https://www.think-cell.com/developers
>
> think-cell Software GmbH (Web site<https://www.think-cell.com>)
> Leipziger Str. 51, 10117 Berlin, Germany
> Main phone +49 30 6664731-0 | US toll-free +1 800 891 8091
>
> Amtsgericht Berlin-Charlottenburg HRB 180042
> Directors: Christoph Hobo, Dr. Arno Schödl
>
> Please refer to our privacy policy<https://www.think-cell.com/privacy> on how we protect your personal data.
>
> On Feb 23, 2024, at 00:50, Zach Laine via Boost <boost_at_[hidden]> wrote:
>
> On Thu, Feb 22, 2024 at 1:36 PM Arno Schoedl via Boost
> <boost_at_[hidden]> wrote:
>
> Since the reviews of Boost.Parser are currently on, I wanted to share an insight we had at think-cell when working with Boost.Spirit. We have standardized on it for years for all custom parsing needs. Most parsers are small, but some are larger, like Excel expressions.
> Of course, our input is mostly Unicode, either UTF-8 or UTF-16. Matching Unicode is complex. Comparison by code point is usually not the right thing. Instead, we must normalize, for which we even have various choices of what to accept as equal:
> https://en.wikipedia.org/wiki/Unicode_equivalence
> Case-insensitive matching is even more complex, slow, and even language-dependent.
> Input is often not guaranteed to be valid Unicode. For example, file names on Windows are sequences of 16-bit units, allowing unmatched surrogates, same with input from Win32 edit boxes and file content.
> But we realized that for almost all grammars we have, all this complexity does not matter. The reserved symbols of most grammars (JSON, XML, C++, URLs, etc.) are pure ASCII. Semantically relevant strings are ASCII as well („EXCEL.EXE“). ASCII can be correctly and quickly matched on a per-code-unit basis. Case-insensitive matching for ASCII is simple and fast. User-defined strings, such as JSON string values, may contain Unicode, but then they usually do not affect parsing decisions. The user may want Unicode validation for these strings, but this can be done by the leaf parser for these strings, rather than for the whole input.
> Since so much matching is against ASCII, we found it useful to have compile-time known ASCII literals in the parser library. With them, the same grammar can be used for all input encodings. When parsing user-defined strings, they will have the encoding of the input, but that’s fine. Any encoding conversion can be dealt with separately from the parser.
> Finally, we may want to parse more than just strings. Parsing binary files, or sequences of DNA, should be possible and efficient.
> Thus I recommend separating Unicode processing from the parser library. The parser library operates on an abstract stream of symbols. For Unicode text these would be code units. It provides the composite parsers such as sequences with and without backtracking, alternatives, Kleene star etc., and leaves the interpretation of the symbols entirely to the leaf parsers, which may or may not care about Unicode.
>
> I think all of the above sounds right to me too. This last part is a
> decent description of how Boost.Parser actually does Unicode handling.
> If the input is a range of char, there is on Unicode anything used,
> anywhere. If it is char{8,16,32}_t, then the parse is in "Unicode
> mode," and the parsers for which that would make a difference act
> accordingly. The user can specify their parsers in ASCII, UTF-8,
> UTF-16, or UTF-32, and the right thing will happen. So, if you put a
> char_('a') in your parser, that's ASCII, and just works, Unicode or
> not, because ASCII is a subset of Unicode. If you put char_(U'X') for
> some Unicode character 'X', that also works, whether in Unicode mode
> or not. If the input is ASCII, char_(U'X') just won't match any of
> the chars being parsed. If the input is Unicode, whether a match
> happens might require a transcoding operation (UTF-N -> UTF-M). You
> never pay for what you don't use.
>
> The only time this fails is if you use some non-Unicode-interoperable
> encoding (say, EBCDIC) in your char_ parser, and then your input is
> some charN_t. Then Boost.Parser will try to compare Unicode to
> EBCDIC, which just won't work. I decided a long time ago I don't care
> about such encodings. The user can still use them, as long as the
> input is a range of chars in encoding E, and all the parsers'
> characters and strings are also in E.
>
> Zach
>
> _______________________________________________
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
>
> _______________________________________________
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk