Boost logo

Boost :

Subject: Re: [boost] [review] Review of Nowide (Unicode) starts today
From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2017-06-14 10:41:07


> Actually I think you provided me a good direction I hadn't considered before.
>
> RtlUTF8ToUnicodeN and other way around function does something very simple:
>
> It substitutes invalid codepoints/encoding with U+FFFD - REPLACEMENT CHARACTER
> which is standard Unicode way to say I failed to convert a thing.
>
> It is something similar to current ANSI / Wide conversions creating ? instead.
>
> It looks like it is better way to do it instead of failing to convert
> entire string all together.
>
> If you get invalid string conversion will success but you'll get
> special characters (that are usually marked as � in UI)
> that will actually tell you something was wrong.

Some additional information about how the NT kernel treats the FFFD
character might be useful to you.

Quite a lot of kernel APIs treat UNICODE_STRING as just a bunch of bytes
e.g. the filesystem. You can supply any path with any characters at all,
including one containing zero bytes. The only character you can't use is
the backslash as that is the path separator. This matches how most Unix
filesystems work, and indeed at the NT kernel level path comparisons are
case sensitive as well as byte sensitive (unless you ask otherwise)
because it's just a memcmp(). You can have real fun here creating lots
of paths completely unparseable by Win32, as in, files totally
inaccessible to any Win32 API with some really random Win32 error codes
being returned.

Other NT kernel APIs will refuse strings containing FFFD or illegal
UTF-16 characters, and if they do it's generally because accepting them
would be an obvious security risk. But those are definitely a minority.

If the Win32 layer doesn't get in the way, doing as RtlUTF8ToUnicodeN()
does should be safe in so far as the kernel team feel it is. They have
placed appropriate checks on appropriate kernel APIs. But in the end,
it's down to the programmer to correctly validate and check all
untrusted input. Doing so always is an unnecessary expense for most end
users who have trusted input.

Niall

-- 
ned Productions Limited Consulting
http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk