Markus W. Scherer, 2000-Mar-12
This is a Gedankenexperiment for what could have been UTF-8 if it had been defined after the Unicode range was set to that of UTF-16, with code points up to only 0x10ffff and not up to UCS-4's 0x7fffffff. I realize that this comes about 7 years too late, but I like the elegance that is possible with a custom-fit encoding form. Be my guest.
The name indicates that this is an encoding form that uses 8-bit code units and is C1-control-code-safe.
Important: This is not an approved encoding for Unicode nor for ISO 10646. It is also not a proposal for a new UTF. It is only meant as a "what if". UTF-8C1 is not compatible with UTF-8.
| Code range | Bits of code points | UTF-8C1 bytes | Lead bytes | Number of bytes |
|---|---|---|---|---|
0..0x9f |
00000 00000000 pppppppp |
pppppppp |
- | 1 |
0xa0..0x39f |
c-0xa0=00000 000000pp ppqqqqqq |
1010pppp 11qqqqqq |
0xa0..0xab | 2 |
0x3a0..0x1039f |
c-0x3a0=00000 ppppqqqq qqrrrrrr |
(10101100+pppp) 11qqqqqq 11rrrrrr |
0xac..0xbb | 3 |
0x10000..0x10ffff |
c-0x10000=0ppqq qqqqrrrr rrssssss |
101111pp 11qqqqqq 11rrrrrr 11ssssss |
0xbc..0xbf | 4 |
Note that 928 code points from 0x10000..0x1039f can be encoded
with either 3 or 4 bytes. They should be encoded with 4 bytes.
(This is arbitrary - I like it better this way.)
Encoding them the other way around results in irregular sequences.
Signature byte sequence: Like the other Unicode encodings, the byte sequence representing U+feff shall be used as a signature. It is 0xbb 0xed 0x9f.
The following sample C code pieces do not check for valid code points, valid trail bytes, or array length overrun.
Writing UTF-8C1
unsigned char *s;
unsigned int c;
int i;
/* write code point c into array s starting at index i */
if(c<=0x9f) {
s[i++]=(unsigned char)c;
} else {
if(c<=0x39f) {
c-=0xa0;
s[i++]=0xa0|(c>>6);
} else {
if(c<=0xffff) {
c-=0x3a0;
s[i++]=0xac+(c>>12);
} else {
c-=0x10000;
s[i++]=0xbc|(c>>18);
s[i++]=0xc0|((c>>12)&0x3f);
}
s[i++]=0xc0|((c>>6)&0x3f);
}
s[i++]=0xc0|(c&0x3f);
}
Reading UTF-8C1
unsigned char *s;
unsigned int c;
int i;
/* read from array s starting at index i into code point c */
c=s[i++];
if(c>=0xa0) {
if(c<=0xab) {
c=0xa0+(((c&0xf)<<6)|(s[i++]&0x3f));
} else if(c<=0xbb) {
c=0x3a0+(((c-0xac)<<12)|((s[i++]&0x3f)<<6)|(s[i++]&0x3f));
} else if(c<=0xbf) {
c=0x10000+(((c&3)<<18)|((s[i++]&0x3f)<<12)|((s[i++]&0x3f)<<6)|(s[i++]&0x3f));
} else {
/* trail byte */
}
}