Boost logo

Boost :

From: Andrey Semashev (andrey.semashev_at_[hidden])
Date: 2024-04-25 15:22:40


On 4/25/24 17:53, Peter Dimov via Boost wrote:
> Rob Boehne wrote:
>> * At the moment wide strings are processed by the name generators
>> by converting every wchar_t to 32 bit, then hashing the bytes, zeroes
>> and all. This doesn't strike me as correct. I think that the string should
>> be converted to UTF-8 on the fly (with 32 bit wchar_t assumed UTF-16
>> and 32 bit wchar_t assumed UTF-32.)
>>
>> To my thinking – a string should just be treated as binary data and it should
>> not have its encoding changed – this should also make less work.
>
> This behavior makes name UUIDs produced by e.g. "www.example.org"
> and L"www.example.org" different, which is unlikely to be what one wants
> in practice, and is against the recommendation of RFC 4122, which says
>
> o Convert the name to a canonical sequence of octets (as defined by
> the standards or conventions of its name space); put the name
> space ID in network byte order.
>
> I don't think anyone can justify the choice of e.g. 0x41 0x00 0x00 0x00 as
> the "canonical sequence of octets" for U"A".

Perhaps, we should simply assume that whatever form of the string the
user provided to the generator is the "canonical" form. That is, if the
user wants "www.example.org" and L"www.example.org" to produce the same
UUID, it is his responsibility to convert those strings to the same
representation before passing it to the generator.

I think, in some regions, Unicode might not be the first encoding of
choice, and there also are incorrectly encoded strings that cannot be
converted to UTF-8. I don't think that Boost.UUID should deal with those
issues.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk