UTF-8C1: A Safe and Simple Unicode Encoding Form

Markus W. Scherer, 2000-Mar-12

This is a Gedankenexperiment for what could have been UTF-8 if it had been defined after the Unicode range was set to that of UTF-16, with code points up to only 0x10ffff and not up to UCS-4's 0x7fffffff. I realize that this comes about 7 years too late, but I like the elegance that is possible with a custom-fit encoding form. Be my guest.

The name indicates that this is an encoding form that uses 8-bit code units and is C1-control-code-safe.

Important: This is not an approved encoding for Unicode nor for ISO 10646. It is also not a proposal for a new UTF. It is only meant as a "what if". UTF-8C1 is not compatible with UTF-8.

Design goals:

Encoding:

Code range Bits of code points UTF-8C1 bytes Lead bytes Number
of bytes
0..0x9f 00000 00000000 pppppppp pppppppp - 1
0xa0..0x39f c-0xa0=00000 000000pp ppqqqqqq 1010pppp 11qqqqqq 0xa0..0xab 2
0x3a0..0x1039f c-0x3a0=00000 ppppqqqq qqrrrrrr (10101100+pppp) 11qqqqqq 11rrrrrr 0xac..0xbb 3
0x10000..0x10ffff c-0x10000=0ppqq qqqqrrrr rrssssss 101111pp 11qqqqqq 11rrrrrr 11ssssss 0xbc..0xbf 4

Note that 928 code points from 0x10000..0x1039f can be encoded with either 3 or 4 bytes. They should be encoded with 4 bytes. (This is arbitrary - I like it better this way.) Encoding them the other way around results in irregular sequences.

Signature byte sequence: Like the other Unicode encodings, the byte sequence representing U+feff shall be used as a signature. It is 0xbb 0xed 0x9f.

Properties and comparison with UTF-8

Sample code

The following sample C code pieces do not check for valid code points, valid trail bytes, or array length overrun.

Writing UTF-8C1

    unsigned char *s;
    unsigned int c;
    int i;

    /* write code point c into array s starting at index i */
    if(c<=0x9f) {
        s[i++]=(unsigned char)c;
    } else {
        if(c<=0x39f) {
            c-=0xa0;
            s[i++]=0xa0|(c>>6);
        } else {
            if(c<=0xffff) {
                c-=0x3a0;
                s[i++]=0xac+(c>>12);
            } else {
                c-=0x10000;
                s[i++]=0xbc|(c>>18);
                s[i++]=0xc0|((c>>12)&0x3f);
            }
            s[i++]=0xc0|((c>>6)&0x3f);
        }
        s[i++]=0xc0|(c&0x3f);
    }
    

Reading UTF-8C1

    unsigned char *s;
    unsigned int c;
    int i;

    /* read from array s starting at index i into code point c */
    c=s[i++];
    if(c>=0xa0) {
        if(c<=0xab) {
            c=0xa0+(((c&0xf)<<6)|(s[i++]&0x3f));
        } else if(c<=0xbb) {
            c=0x3a0+(((c-0xac)<<12)|((s[i++]&0x3f)<<6)|(s[i++]&0x3f));
        } else if(c<=0xbf) {
            c=0x10000+(((c&3)<<18)|((s[i++]&0x3f)<<12)|((s[i++]&0x3f)<<6)|(s[i++]&0x3f));
        } else {
            /* trail byte */
        }
    }