WCode - What Unicode could have been

Markus W. Scherer, 2001-Mar-18

This is another Gedankenexperiment with Unicode (March seems to be a good month for those), this time for what Unicode itself could have been without most of the compromises that were necessary to make it successful.

This is not a serious proposal; it is purely intended for discussion, study, and comparison.

Introduction

I call this "WCode" to give it a new initial 'W' for derived definitions, just because 'W' is "double U" in English...

WCode is derived from Unicode, with most compromises against its founding principles removed. The one compromise that it keeps is that the encoding range is larger than 64k, since it makes it easier to define it in a useful way (with most of the characters in Unicode). Also, although Unicode Ideographic Description Sequences provide a way to encode CJKV ideographs with a small set of sub-ideographic characters, it would take a lot of analysis to apply this to the full set of ideographs.

Structure

Assigned Characters

Encoding Forms

Conversion from and to Unicode

Unicode text is converted to WCode by normalizing it (NFKD), reordering Thai/Lao, moving U+e000..U+ffef to W-0d800..W-0f7ef, and changing the UTF to a WTF. In addition, U+feff is converted to either W-0f6ff if it is used as a ZWNBSP or to W-0f7ff if it is used as a BOM.
Plane 16 cannot be encoded.

After normalization and reordering, UTF-16 text can be transformed easily to WTF-16 by transforming code units. (In fact, this is similar to the "fix-up" necessary for comparing UTF-16 strings in code point order, except that U+fff0..U+ffff become W-0fff0..W-0ffff, the leading and trailing surrogate ranges are reversed, and there are fewer leading surrogates.)

WCode text is converted to Unicode by reversing the above steps, without the need for normalization.

What is missing from Unicode?

WTF-8

Single bytes: 0..0x9f
Lead bytes: 0xa0..0xbf
Trail bytes: 0xc0..0xff
Overlap: 0x10000..0x1039f should be encoded with 4 bytes, not 3
Illegal: The 4-byte-accessible value range 0x100000..0x10ffff
(See also UTF-8C1.)

Code range Bits of code points WTF-8 bytes Lead bytes Number
of bytes
0..0x9f 00000 00000000 pppppppp pppppppp - 1
0xa0..0x39f c-0xa0=00000 000000pp ppqqqqqq 1010pppp 11qqqqqq 0xa0..0xab 2
0x3a0..0x1039f c-0x3a0=00000 ppppqqqq qqrrrrrr (10101100+pppp) 11qqqqqq 11rrrrrr 0xac..0xbb 3
0x10000..0xfffff c-0x10000=0ppqq qqqqrrrr rrssssss 101111pp 11qqqqqq 11rrrrrr 11ssssss 0xbc..0xbf 4

Acknowledgements

I would like to thank Mark Davis for feedback.