[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: C strings, NULs (not NULLs)... and "modified UTF-8"
- From: sur-behoffski <sur_behoffski@...>
- Date: Fri, 8 Jul 2016 19:17:37 +0930
(Replying to the digest, so apologies for lack of threading.)
UTF-8 has the nice property that the NUL (zero) octet (usually
byte or char, although some DSPs have 32-bit chars...) never
occurs in a valid sequence, except as the ASCII/EBCDIC
character '\0', where it can happily serve as a terminator.
There is an alternative way of encoding zero (U+000000), which
takes two octets and avoids using a zero octet, but, sadly,
Unicode prohibits alternate (longer) encodings this as an invalid
sequence: 0xc0 0x80.
For most code points, the alternate-encoding prohibition is a very
welcome property, as it makes input validation easier.
However, people have noticed this overlap, have made it a fairly
formal informal standard called "Modified UTF-8", and Wikipedia
notes the existence of "Modified UTF-8" implementations in various
places, including Java components, and Tcl internals (remember,
however, that citing Wikipedia is NOT the same as citing a
trustworthy standard).
With Modified UTF-8, you both get clear NUL octet encoding as two
octets, but also allow a NUL octet to be appended as C string
terminator, which can ease using legacy C interfaces.
-- sur-b.