Useless Factor: unicode

Showing posts with label unicode. Show all posts

Thursday, January 15, 2009

XML encoding auto-detection

The Factor XML parser now auto-detects the encodings of XML documents. This is implemented for all of the encodings that are implemented in Factor. To see how it's implemented, look at the XML standard, because it explains it much better than my blog post, which was below.

I was mystified myself when I first read that XML documents can specify what encoding they are, in the document itself. The encoding, if it's not UTF-8, must specified in the prolog like this:
<?xml version="1.0" encoding="ISO-8859-1"?>

The idea of the algorithm is simple. It just goes by cases. First, check for a byte order mark (BOM), which would indicate UTF-16 and a particular endianness, or UTF-8. If there's no BOM, then the first character must be < or whitespace. If it's <, we can differentiate between UTF-16BE, UTF-16LE (without BOMs) and an 8-bit encoding. If it's one of the first two, we can tell by the fact that there's a null byte before or after the <. If it's an 8-bit encoding, we can be sure that there won't be any non-ASCII in the prolog, so just read the prolog as if it's UTF-8, and if an encoding is declared, use that.

To implement it, I just read byte by byte and have a case statement for each level. After two just octets, it's possible to differentiate between UTF-8, UTF-16 (with a BOM, for both endiannesses), UTF-16BE and UTF-16LE. A similar process could also identify UTF-32 and friends after 4 octets. In my implementation, I had to do a little bit of hacking inside the XML code itself to get this integrated properly. All together, it's about 40 or 50 lines of code. It's available now in the Factor git repository.

[Update: Thanks for pointing out my error, Subbu Allamaraju. Fixed a typo, see comments.]

Thursday, January 8, 2009

I'm back

Hi again! I've been gone for the past few months because in the fall I was away in Budapest studying math, and it was really hard. Now, I've decided to take three months off from school to work on Factor full-time. My plan is first work on finishing up Unicode and XML, and then try to improve Factor's garbage collector.

For Unicode, over the past couple days, I fixed bugs in normalization and grapheme breaking, and I implemented word breaking. This was all made a lot easier by the test suites that Unicode includes, and these are now used as unit tests. I also wrote docs for everything. I still need to optimize everything and clean up the code, though. There are a lot more algorithms that I could implement, and plan on implementing eventually, but I'm just going to do this for now.

For XML, I plan on hooking up the XML conformance test suite and fixing any conformance issues that come up, for starters. I've been terribly informal about testing in the past, and I'll try to change this. Instead of optimizing the current XML code directly, I plan on replacing it with a generated lexer and parser: I'll try to make a system like lex/yacc in Factor. Previously, Chris Double has made some pretty cool stuff for parsing with parser combinators and parsing expression grammars, but these both have efficiency problems. I know that a traditional system like lex/yacc can solve the problem of parsing XML, and while it's not as flexible as PEGs, it still might be possible to add high-level OMeta-like features. Parsing is something that I don't know a lot about, so I'll be doing some reading for this.

For garbage collection, I want to actually implement something after studying it for so long! I plan on first making Factor's GC generational mark-sweep, and investigating changes to the generational policy to reduce thrashing. Then, there are two things that I want to do: make it incremental, and make some kind of compaction. Maybe this will take the form of mark-copy, or maybe it'll just be an incremental mark-sweep collector with occasional stop-the-world compaction (which could be disabled for certain applications). Either way, expect the footprint of Factor programs in the future to be reduced by almost half, and hopefully GC pauses will be shorter most of the time.

Saturday, June 7, 2008

A second introduction to Unicode

If you're a normal programmer, you probably really don't want to have to think about Unicode. In what you're doing, text processing probably isn't a very important aspect, and most of your users will be using English. Nevertheless, text has a tendency to creep its way into almost everything, as the basic computer/human interface. So it might be a little beneficial to know about the basics of text processing and the Unicode character set.

A lot of people have written blog posts which are introductions to Unicode, and I didn't want to write another one with no new information in it. A popular one is Joel (on Software)'s one, which describes what Unicode is and why it's important. You've likely already read an introduction to Unicode, so I'll just summarize the most important points:

You can't assume text is in ASCII anymore This isn't just about being nice to non-English speakers. Even Americans enjoy their “curly quotes”, their cafés—and their em dashes. User input might come with these non-ASCII characters, and it must be handed properly by robust applications.
Unicode is the character set to use internally A bunch of character sets have been developed over the years for different purposes, but Unicode can represent more scripts than any one other character set. Unicode was designed to be able to include the characters from basically all other character sets in use. If you're using C or C++, wchar_t rather than char for strings works for most cases. If you're using a higher level language, then strings should already be stored in some representation that allows for Unicode uniformly.
There are several text encodings Not all text is in ASCII, and very little text is in the most common 8-bit extension, Latin 1 (ISO-8859-1). Lots of input is in UTF-8, which can represent all of Unicode, but there are other Unicode encodings, as well as specific East Asian encodings like GB 2312 and Shift JIS, in active use. Generally, UTF-8 should be used for output, and it's on the rise in terms of usage. Depending on the programming language or library used, you might have to account for the encoding when doing text processing internally. UTF-16 and UTF-8 are the most common, and careless programming can get meaningless results in non-ASCII or non-BMP cases if the encoding is ignored.
Unicode encodes characters, not glyphs Unicode can be seen as a mapping between numbers and code points, where a code point is the basic unit of Unicode stuff. It's been decided that this basic unit is for characters, like letters and spaces, rather than specific presentation forms, which are referred to as glyphs. Glyphs are something that only font designers and people who work on text rendering have to care about.

But there's a little bit more that programmers have to know about. Unicode is part of a bigger program of internationalization within a single framework of encodings and algorithms. The Unicode standard includes several important algorithms that programmers should be aware of. They don't have to be able to implement them, just to figure out where in the library they are.

Normalization Because of complications in the design, some Unicode strings have more than one possible form that are actually equivalent. There are a number of normalization forms that have been defined to get rid of these differences, and the one you should use is probably NFC. Usually, you should normalize before doing something like comparing for equality. This is independent of locale.
Grapheme, word, sentence and line breaks It's not true, anymore, that a single character forms a single unit for screen display. If you have a q with an umlaut over it, this needs to be represented as two characters, yet it is one grapheme. If you're dealing with screen units (imagine an internationalized Hangman), a library should be used for grapheme breaks. Similarly, you can't identify words as things separated by spaces or punctuation, or line break opportunities by looking for whitespace, or sentence breaks by looking just at punctuation marks. It's easy to write a regular expression which tries to do one of these things but does it wrong for English, and it's even easier to do it wrong for other languages, which use other conventions. So use a Unicode library for this. The locale affects how these breaks happen.
Bidirectional text When displaying text on a screen, it doesn't always go left to right as in most languages. Some scripts, like Hebrew and Arabic, go right to left. To account for this, use the Unicode Bidirectional Text Algorithm (BIDI), which should be implemented in your Unicode library. Locale doesn't matter here.
Case conversion Putting a string in lowercase is more complicated than replacing [A-Z] with [a-z]. Accent marks and other scripts should be taken into account, as well as a few weird cases like the character ß going to SS in upper case. The locale is also relevant in case conversion, to handle certain dots in Turkish, Azeri and Lithuanian.
Collation There's an algorithm for Unicode collation that works much better than sorting by ASCII value, and works reasonably for most languages. Depending on the locale, it should be modified. Even in English, the Unicode Collation Algorithm produces much more natural results. Parts of the collation key can be used for insensitive comparisons, eg. ignoring case.

For further reading, you can look at this much more in-depth article from SIL, or the Unicode 5.1 standard itself, which isn't that bad. Most programmers can't be expected to know all of this stuff, and they shouldn't. But it'd be nice if everyone used the appropriate library for text processing when needed, so that applications could be more easily internationalized.

Sunday, May 25, 2008

Unicode collation works now

This morning I got Unicode collation to pass all of the 130,000+ unit tests. It was a bit more difficult than I imagined, and it's still far from complete for a number of reasons. The whole thing is less than 200 lines (the core algorithm in about 100) in the unicode.collation vocab in the working version of Factor in git. Here's what I figured out:

Weird format of UnicodeData.txt It's not documented anywhere that I can find, but the UnicodeData.txt resource file has a weird range format for specifying certain properties, including character classes, which are used in collation. It looks just like two regular lines, but they have weird names for the characters that apparently need to be parsed. For example, lines that look like
```
100000;<Plane 16 Private Use, First>;Co;0;L;;;;;N;;;;;
10FFFD;<Plane 16 Private Use, Last>;Co;0;L;;;;;N;;;;;
```
mean that all of the characters in the range U+100000 to U+10FFFF have the category Co, the combining class 0, etc.
My normalization bugs Working on this uncovered a bunch of bugs in older code, including that my conjoining Jamo behavior inserted nonexistent terminator characters.
Triple contractions The UCA specifies that collation graphemes should be found by checking if an adjacent character or non-blocked combining character has a contraction with a previous character. But this incremental approach doesn't work with four of the contractions listed in the DUCET which consist of three, not two, elements, without having the first two forming a contraction. So a simple identity contraction for the first two of each of those must be added.
Combining character contractions Apparently, two combining marks can form a contraction. A straight reading of the UCA wouldn't predict this, but not all of the UCA tests pass unless you check for non-adjacent combining marks being in a contraction together, without a noncombining mark to start it off.

And here's what I still have to do:

Korean stuff Because of some disagreement with the ISO people, the DUCET doesn't actually decide the best way to sort Korean. Instead, they provide three methods, both of which require modifying the table. I don't really understand the issue right now.
Tailoring for locales Actually, heh, the default algorithm is inaccurate for any specific locale you might be in. And, for human interfaces, it's actually pretty important that the sort order corresponds to expectations. So, if you want to do sorting that's correct, you have to modify the data. Unfortunately, the standard doesn't go into the specific algorithms for tailoring the table, though there is data available through the Common Locale Data Repository (CLDR).
Efficiency My implementation is both time and space inefficient, because I paid absolutely no attention to those, because solving the basic problem is hard enough (for me). Collation keys should be made shorter, and they should be made in fewer passes over the string.

Update: Here's an overview of the words that are defined in the vocabulary. It's about the minimum that any UCA implementation should have, in my opinion:

sort-strings This word takes a sequence of strings and sorts them according to the UCA, using code point order as a tie-breaker.
collation-key This takes a string and gives a representation of the collation key, which can be compared with <=>
string<=> This word takes two strings and compares them using the UCA with the DUCET, using code point order as a tie-breaker.
primary= This checks whether the first level of collation is identical. This is the least specific kind of equality test. In Latin script, it can be understood as ignoring case, punctuation and accent marks.
secondary= This checks whether the first two levels of collation are equal. For Latin script, this means accent marks are significant again.
tertiary= Along the same lines as secondary=, but case is significant.
quaternary= This is a little less typical (and definitely non-essential, unlike the other things), and it's my own nomenclature, but it makes punctuation significant again, while still leaving out things like null bytes and Hebrew vowel marks, which mean absolutely nothing in collation.

Friday, May 23, 2008

Little things I've been working on

I've been working on a few different things that, individually, are too insignificant for a blog post, so I'll put them together.

One is, I expanded my previous survey of algorithms for XML diffing, and the result is here [PDF].

I've been working on a few libraries in Factor. One is extra/delegate, which now interacts properly with reloading. For example, if you define a protocol, then say that a class consults something for that protocol, and then redefine the protocol to include more generic words, the consulting class will be automatically updated. Unfortunately, this doubled the size of the code, give or take. Slava changed duplex-streams to use extra/delegate, and the code is much simpler now, as it previously amounted to manual delegation. I got rid of mimic because it's unsafe and violates encapsulation in unexpected ways.

Another little thing I made was extra/lcs, a library for calculating Levenshtein distance between two strings, the longest common subsequence of two strings, and an edit script between two strings. Because the LCS problem and Levenshtein distance are duals, I was able to share most of the code between them. I used the simple quadratic time and space algorithm that Wikipedia describes rather than the better O(nd) algorithm [PDF] commonly called the GNU diff algorithm. I'll upgrade it to this once I understand the algorithm. This replaces extra/levenshtein. I expected it to be significantly faster, because the old code used dynamically scoped variables and this uses statically scoped locals, but the speed improvement turned out to be just around 4% in small informal benchmarks on short strings.

Now, I'm working on the Unicode Collation Algorithm. The basics were simple, but I'm still unsure how to recognize collation graphemes efficiently in general. Either way, I discovered a bug in normalization: my insertion sort, used for canonical ordering of combining marks, wasn't a stable sort as required for normalization. It was actually an anti-stable sort: it reversed subsequences which were of the same sort key. That was really stupid of me. I'm going to work on incorporating existing test suites for things like this. For the test suite for collation, all but 8000 of 130,000 tests pass, making it far from ready.

Wednesday, May 7, 2008

Interval maps in Factor

Recently, I wrote a little library in Factor to get the script of a Unicode code point. It's in the Factor git repository in the vocab unicode.script. Initially, I relatively simple representation of the data: there was a byte array, where the index was the code point and the elements were bytes corresponding to scripts. (It's possible to use a byte array because there are only seventy-some scripts to care about.) Lookup consisted of char>num-table nth num>name-table nth. But this was pretty inefficient. The largest code point (that I wanted to represent here) was something around number 195,000, meaning that the byte array took up almost 200Kb. Even if I somehow got rid of that empty space (and I don't see an obvious way how, without a bunch of overhead), there are 100,000 code points whose script I wanted to encode.

But we can do better than taking up 100Kb. The thing about this data is that scripts are in a bunch of contiguous ranges. That is, two characters that are next to each other in code point order are very likely to have the same script. The file in the Unicode Character Database encoding this information actually uses special syntax to denote a range, rather than write out each one individually. So what if we store these intervals directly rather than store each element of the intervals?

A data structure to hold intervals with O(log n) lookup and insertion has already been developed: interval trees. They're described in Chapter 14 of Introduction to Algorithms starting on page 311, but I won't describe them here. At first, I tried to implement these, but I realized that, for my purposes, they're overkill. They're really easy to get wrong: if you implement them on top of another kind of balanced binary tree, you have to make sure that balancing preserves certain invariants about annotations on the tree. Still, if you need fast insertion and deletion, they make the most sense.

A much simpler solution is to just have a sorted array of intervals, each associated with a value. The right interval, and then the corresponding value, can be found by simple binary search. I don't even need to know how to do binary search, because it's already in the Factor library! This is efficient as long as the interval map is constructed all at once, which it is in this case. By a high constant factor, this is also more space-efficient than using binary trees. The whole solution takes less than 30 lines of code.

(Note: the intervals here are closed and must be disjoint. <=> must be defined on them. They don't use the intervals in math.intervals to save space, and since they're overkill. Interval maps don't follow the assoc protocol because intervals aren't discrete, eg floats are acceptable as keys.)

First, the tuples we'll be using: an interval-map is the whole associative structure, containing a single slot for the underlying array.


TUPLE: interval-map array ;

That array consists of interval-nodes, which have a beginning, end and corresponding value.


TUPLE: interval-node from to value ;

Let's assume we already have the sorted interval maps. Given a key and an interval map, find-interval will give the index of the interval which might contain the given key.


: find-interval ( key interval-map -- i )
    [ from>> <=> ] binsearch ;

interval-contains? tests if a node contains a given key.


: interval-contains? ( object interval-node -- ? )
    [ from>> ] [ to>> ] bi between? ;

Finally, interval-at* searches an interval map to find a key, finding the correct interval and returning its value only if the interval contains the key.


: fixup-value ( value ? -- value/f ? )
    [ drop f f ] unless* ;

: interval-at* ( key map -- value ? )
    array>> [ find-interval ] 2keep swapd nth
    [ nip value>> ] [ interval-contains? ] 2bi
    fixup-value ;

A few convenience words, analogous to those for assocs:


: interval-at ( key map -- value ) interval-at* drop ;
: interval-key? ( key map -- ? ) interval-at* nip ;

So, to construct an interval map, there are a fewi things that have to be done. The input is an abstract specification, consisting of an assoc where the keys are either (1) 2arrays, where the first is the beginning of an interval and the second is the end (2) numbers, representing an interval of the form [a,a]. This can be converted into a form of all (1) with the following:


: all-intervals ( sequence -- intervals )
    [ >r dup number? [ dup 2array ] when r> ] assoc-map
    { } assoc-like ;

Once that is done, the objects should be converted to intervals:


: >intervals ( specification -- intervals )
    [ >r first2 r> interval-node boa ] { } assoc>map ;

After that, and after the intervals are sorted, it needs to be assured that all intervals are disjoint. For this, we can use the monotonic? combinator, which checks to make sure that all adjacent pairs in a sequence satisfy a predicate. (This is more useful than it sounds at first.)


: disjoint? ( node1 node2 -- ? )
    [ to>> ] [ from>> ] bi* < ;

: ensure-disjoint ( intervals -- intervals )
    dup [ disjoint? ] monotonic?
    [ "Intervals are not disjoint" throw ] unless ;

And, to put it all together, using a tuple array for improved space efficiency:


: <interval-map> ( specification -- map )
    all-intervals [ [ first second ] compare ] sort
    >intervals ensure-disjoint >tuple-array
    interval-map boa ;

All in all, in the case of representing the table of scripts, a table which was previously 200KB is now 20KB. Yay!

Friday, March 14, 2008

A protocol for creating encodings

I previously wrote about the API that I designed for creating streams with encodings in Factor. I'm not sure if that's going to stick around permanently in this form, due to concerns about easily changing stream encodings and grouping encodings with a pathname as one object on the stack.

Either way, I wanted to describe the protocol I'm developing for actually defining new encodings in Factor. This code isn't completely debugged but it should be done and in the main Factor development repository very soon. There are four words in the encoding protocol:


GENERIC: <encoder> ( stream encoding -- encoder-stream )
GENERIC: <decoder> ( stream decoding -- decoder-stream )
GENERIC: encode-char ( char stream encoding -- )
GENERIC: decode-char ( stream decoding -- char/f )

Let's go through these. First, the constructors <encoder> and <decoder>. These are very rarely called directly by the library user, more often by stream constructors. For example, when you do "filename" utf8 <file-reader>, what's going on underneath is "filename" (file-reader) utf8 <decoder>. (file-reader) is a low-level constructor that gives you a binary stream, and <decoder> wraps it an a decoded stream using the specified encoding descriptor, utf8.

I have some slightly funny methods on <encoder> and <decoder>. See, right now, all encodings are tuples, and their abstract descriptors are tuple classes. All tuple class symbols are in the class tuple-class, and all tuples are in the class tuple. So we can define methods on the two constructor words, for tuple classes one which makes an empty instance of the encoding tuple class and calls the constructor again, and for encoding tuples one which actually puts together the instance of the physical encoder or decoder tuple. Here's how it looks:


M: tuple-class <decoder> construct-empty <decoder> ;
M: tuple <decoder> f decoder construct-boa ;

M: tuple-class <encoder> construct-empty <encoder> ;
M: tuple <encoder> encoder construct-boa ;

One reason these need to be generic is for things like binary streams, where methods on these generic words are implemented as dummies: a binary encoding is just the lack of encoding


TUPLE: binary ;
M: binary <encoder> drop ;
M: binary <decoder> drop ;

Another reason is that certain encodings require processing at the beginning. For example, UTF-16 should write a byte order mark (BOM) immediately when it's initialized for writing, and read a BOM immediately when it's initialized for reading.


M: utf16 <decoder> ( stream utf16 -- decoder )
    2 rot stream-read bom>le/be <decoder> ;

M: utf16 <encoder> ( stream utf16 -- encoder )
    drop bom-le over stream-write utf16le <encoder> ;

Now, let's look at the other words. The idea of encode-char and decode-char is that it's simpler for encodings to encode or decode one character than implement all the relevant functions of the stream protocol. encode-char takes an encoding, an underlying stream and a character to write to that underlying stream.

The inverse, decode-char, takes an underlying stream and an encoding and uses the encoding to pull a character from the stream. For everything I've implemented so far, the encoding is dropped after method dispatch, but when things like Shift JIS, which require state in decoding, are implemented, the state will be stored in the tuple.

This is all much simpler than my previous design, which required looping to decode a single character and forced encodings to adopt a complicated state-machine-based model. This is something like the third iteration of the encoding protocol I've made, and the code is finally starting to look good.

In Factor, it takes a little bit of work to make certain things, like encodings, have clean code. The appropriate abstractions don't fall out as immediately obvious, but eventually they're found. The result is far more maintainable and clean. I'm not sure what this would imply on big projects with bad programmers. (But I plan to never work in an environment like that; better to be an academic if good work in a small company can't be found.)

Anyway, future pie-in-the-sky plans for encodings include treating cryptographic protocols and compression as encodings (under different protocols, of course). This is really cool: there are five orthogonal layers: stream, cryptography, compression, text encoding and usage. It'll be possible to compose them and factor out their compositions in any way you want! But this doesn't exist, so I probably shouldn't even be talking about it.

Update: Fixed stupid typos. Thanks Slava!

Sunday, February 17, 2008

Designing an API for encoded streams

When I started looking at Unicode to design a good library for Factor, I wanted to make an API such that the programmer never needed to think about Unicode at all. I now see that that's impossible, for a number of reasons. One thing that the programmer needs to explicitly think about is the encoding of files. For this reason, I'm in the middle of changing lots of words which deal with streams to take an extra mandatory parameter specifying the encoding. The encodings supported so far are binary, ascii, latin1, utf8, utf16 and some more. In the library, we'll eventually put Shift JIS, more 8-bit encodings like MacRoman, Windows 1252 and the other ISO-8859s, UTF-32, etc. Internally, all strings are already in Unicode; this is only for external communication.

Mandatory?!

Some people objected to this. Why should there be a new mandatory parameter when everything worked already? This makes code longer, rather than shorter! The rationale is that this expands the functionality of streams. With the old functionality, everything is treated as if it's encoded in Latin 1. But in 99% of cases, this is just wrong. When a text file isn't in plain old ASCII, it's almost always in UTF-8 (though occasionally it's in UTF-16 or Shift JIS). Things are rarely in an 8-bit encoding because of its ambiguity; 8-bit non-ASCII encodings can safely be labeled "legacy" except on specialized low-resource applications. Right now things work like Latin 1 is the encoding for all streams, but if we want to do much actual text processing, things will come out wrong.

Even if UTF-8 is used most of the time, could we use a heuristic to determine what encoding things are in? If we know the file is in either ASCII, UTF-8, UTF-16 or UTF-32, it's not too hard to come up with some kind of heuristic that works in almost all cases. But once things get generalized to Shift JIS and 8-bit encodings, it's basically impossible to determine generally how things are encoded. And it's completely impossible if there are binary streams, or for output streams all together.

So let's make UTF-8 the default encoding. Any stream which doesn't want to use UTF-8 should have its instantiation followed by some set-encoding word. But what about binary streams? These aren't uncommon and are needed for things like audio, video, compressed data and Microsoft Word documents. If UTF-8 is the default encoding, it'd be easy to open a file for reading or writing, forgetting that it's in UTF-8, and then writing stuff to it as if it's a binary stream. But if we make the encoding a mandatory explicit parameter, then nobody will forget: if you want to open a stream reading it as UTF-8, you can do utf8 <file-reader>, and if you want to open it as binary, you can do binary <file-reader>. Writing utf8 or binary isn't just boilerplate: it actually indicates some information about how things should work. And for those situations where you want some other encoding, that can be specified just as easily.

Now, do we really want to prefix each stream constructor with an encoding, or can it be determined, explicitly, in the context somehow? There are two ways to scope this, lexical and dynamic, and they both fail. With dynamic scoping, composability is broken: if one piece of code makes some assumption about the encoding—say, that the encoding is UTF-8, which could be the default global encoding—but then the caller sets it to something else. So the encoding must be set lexically. But when I looked at actual code samples, I saw it'd be more trouble than it's worth to have a lexically scoped encoding: nearly all words which open streams which need an encoding only need one or two. You're at most writing the same encoding twice, but it ends up being fewer words than a whole scope declaration (which needs, at a minimum, brackets, the encoding name and something declaring that this is a scope for encoding purposes). What about vocab-level scoping? It could work, but it'd have to be overridden in too many cases to be useful, since it's not unrealistic to have a vocab which uses UTF-8 for half of its streams and binary for the other half.

One other thing that's useful and not particularly common in these sorts of libraries is the fact that the encoding can be changed after the stream is initialized. This is useful for things like XML, where the encoding can be declared in the prolog, and certain network protocols like HTTP and SMTP which allow the encoding to be specified in the header, so the encoding needs to change on the fly. I can only assume that previous implementations of this took everything as binary and used string processing routines to get things in and out of the right encoding.

You might think of this as a standard library cruely forcing everyone to specify every little detail, but I think of it a little differently: the file I/O API encourages programmers to think about the encodings of their files. We could go the other way, still, and use UTF-8 as the default, but it'd create some strange and unreadable bugs. Any default is bad. All other stream APIs I've looked at make this optional, but no matter which way you go this makes misleading assumptions for programmers.

Specifics

<file-reader>, <file-writer>, <file-appender>, <client> and <server> will now take an extra argument of an encoding descriptor, making them have the stack effect ( path/addrspec encoding -- stream ). file-contents and file-lines also take an encoding from the top of the stack. process-stream's encodings are in the descriptor, as a possible value for stdin, stdout or stderr, indicating that those values will be sent to/from Factor as a stream of the given encoding. If you're dealing with files, the process you call should handle all encoding issues. Some streams, like HTML streams and pane streams, don't need changes, since their encoding is unambiguous. You also don't need to specify the encodings of file and process names, since those are OS-specific and handled by the Factor library.

In addition to <string-reader>s and <string-writer>s that already exist and remain unchanged (they don't need an encoding since everything that goes on there is in Factor's internal Unicode encoding), there are now also <byte-reader>s and <byte-writer>s which do have an encoding as a parameter. Byte readers and writers work on an underlying byte vector, and provide the same encodable interface that files do, because an array of bytes, unlike a string, can take multiple interpretations as to the code points it contains.

I renamed with-file-out to with-file-writer, with-file-in to with-file-reader, string-in to with-string-reader and string-out to with-string-writer for consistency. Additionally, there are now also words with-byte-reader and with-byte-writer. Since byte and file readers and writers need an encoding, in these combinators I've put the encoding before the quotation. It could be the other way around, and really this was an arbitrary choice. Conceptually, you can think of it like the file name or byte array and the encoding form a sort of unit, so they're consistently adjacent in the words which use them.

I've made all the updates to everyone's software in my local branch, so you don't have to worry about implementing these changes. You might want to go back and look at your code to make sure the encoding I chose was sane. 90% of the time it's binary or UTF-8, occasionally ASCII. It's usually clear-cut. Also, I never had to make more than 3 or 4 updates in a single file.

It'd be nice if things were simpler, and nobody had to consider encodings at all except for Unicode library writers. Theoretically, this could be solved by a standard way to denote, inside the file, what encoding the rest of the file is in. But if we did that, then multiple competing encoding encodings might emerge, and we'd have to explicitly choose among them! It'd be even better if the filesystem had metadata on this, but it doesn't. Maybe, on the Factor end, there's a place for having an abstraction over the locations of resources grouped with a description of their type (either encoding or filetype). But either way, encodings just aren't simple enough to allow programmers not to think about them.

Update: Added more info about specifics. It's been taking me a little longer than I initially thought to get this whole thing working with Factor, so this stuff still isn't in the main branch, though you can see the progress in the unicode branch of my repository. Bootstrapping will take a little work, though. The changes have been integrated into Factor! Thanks, Slava, for making it all work.

Thursday, January 31, 2008

The most common Unicode-processing bug

The most common Unicode-processing bug is as pervasive as it is trivial: UTF-8 is confused for some 8-bit repertoire and encoding, or the other way around. Most commonly it's something like Windows-1252, a poorly specified superset of ISO 8859-1 (Latin 1). This, I'm fairly sure, is the source of all of those "funny Unicode characters".

For example, on the Unicode mailing list digest as read in GMail, I received a letter a few days ago which opened, "I donâ€™t understand..." I'm not sure where in the transmission line this happened, but I'm sure that nobody typed those characters. More likely, they used a single right quotation mark U+2019 (or rather their email program did), which would be encoded in UTF-8 as 0010 0000 0001 1001 ==> 11100010 1000000 10011001 = 0xE2 0x80 0x99 = â€™ in Windows-1252.

(Note: the giveaway that this is Windows-1252 and not Latin 1 is the Euro sign, which is a relatively recent invention and not encoded into any official ISO 8859 that people actually use*. In all ISO 8859s, 80 is reserved as a kind of fake control character.)

Here's how it might have happened: the email program declared that everything was in Windows-1252, though it was not, and the mailing list server correctly decoded that encoding into the corresponding Unicode code points. Alternatively, maybe no encoding was specified, and since Windows-1252 is a superset of Latin 1, which in turn is a superset of ASCII, it was used as a "maximally tolerant" assumed encoding where ASCII would be the normal default. Alternatively, maybe the mailing list itself failed to specify what encoding it was using, and GMail made a similar mistake. This is more likely, as things consistently appear for me as this way when reading the Unicode mailing list.

This bug is at once easy to fix and impossible. Because compatibility with legacy applications needs to be maintained, it's difficult to change the default encoding of things. So, everywhere it's possible, things need to say explicitly what their encoding is, and applications need to process this information properly.

Still, do we care more about maintaining compatibility with legacy applications or getting things working today? In practice, almost everything is done in UTF-8. So it should be fine to just assume that encodings are always UTF-8, legacy programs be damned, to get maximum utility out of new things.

Well, as it turns out, that's not always the best thing to do. Someone recently told me that he suspects a Unicode bug in Amarok: it wasn't correctly loading his songs that had German in the title, though it worked correctly on a previous installation. Instead, I think the bug was in incompatible default settings for GNU tar or ISO format. The songs used to have accented letters, but the files were transferred onto a different computer. Now, those letters were single question marks when viewed in Konquerer, and Amarok refused to open them, giving a cryptic error message.

UTF-8 is fault-tolerant, and a good decoder will replace malformed octets with a question mark and move on. This is probably exactly what happened: the title of a song contained a character, say ö, which was encoded in Latin 1 as 0xF6, followed by something in ASCII. The song title was encoded in Latin 1 when the file system expected UTF-8. The UTF-8 decoder in Konquerer replaced the 0xF6 with a � (replacement character, U+FFFD), but Amarok threw an error for some reason.

So, for all of this, there's no good solution but to be more careful and mindful of different encodings. In most cases, you can use a heuristic to determine whether something is in Windows-1252 or UTF-8, but this can never be completely accurate, especially if other encodings are considered at the same time.

* ISO 8859-15 and -16 actually have the Euro sign, but I really doubt many people use them, as they were released around the turn of the millennium, when Unicode was already in broad use, and come with the difficulties of telling them apart from other ISO 8859s.

Update: An anonymous commenter pointed out that it's not too hard to use a heuristic to differentiate between Windows-1252 and UTF-8.

Monday, December 31, 2007

Unicode implementer's guide part 6: Input and output

Until now, in my blog posts on Unicode, I've been focusing on internal processing which has only an indirect effect on users. But now I'm thinking about something else: how can you write a text editing widget which potentially supports as many locales as possible? This should generally be the goal when writing internationalized applications, and almost all applications should be eventually internationalized, so why not a text editor?

As it turns out, this is an extremely complicated task, combining the difficulties of input methods and rendering. In this article I'll try to brush on as many relevant topics as possible, trying not to ramble too much on any particular one. Really, each section should probably be given its own blog post.

To test the Unicode-bind approach, I'm looking at the Factor .91 editor gadget, which pays no attention at all to internationalization and basically parrots back whatever characters it receives. Different keyboard layouts were tested on Mac OS X 10.5, though that shouldn't matter here.

Typing European scripts

European scripts are pretty easy. You just type letters on the keyboard and they appear, always in left-to-right order, on the screen, one character appearing for each keystroke in many languages. This works for English, German, Russian, Georgian and many other languages, but not for, say Spanish or Vietnamese. In those languages, there are characters like é or ở which require multiple keystrokes to input. With a typical Spanish keyboard, to write é, first the ´ key is pressed and then the e key. For ở, you first press the ơ key and then the ̉ key.

There are more complicated input methods for, say, typing é with a US keyboard, and these are system-dependent. On Mac OS X, I use Alt-e to get that first combining ´ mark, and then an e to constitute the character under it. This can be used in any application. In Microsoft Office applications on Windows, you can use Ctrl-' to make a combining ´ mark. Elsewhere, you can't use that key chord, just some complicated keystrokes to input the right hex code. I believe Gnome and KDE each define their own mechanisms for this kind of input.

From a Unicode-bind text editor, none of these multiple-character inputs work. When in the Spanish keyboard mode in the Factor editor gadget, pressing ´ does nothing, and when in Vietnamese mode, pressing ơ does nothing. This is because the input system expects something more complicated to happen next. The OS-specific keys, of course, don't work properly either unless there is special handing for them. All of these things must be taken into account in an internationalized editor widget.

Typing East Asian scripts

When looking at the possible input methods in the Mac keyboard layout chooser, you're immediately struck by the fact that four East Asian languages--Chinese, Japanese, Korean and Vietnamese, have far more input methods listed than any other individual language. This is probably because (more than anything else) of the difficulty involved in inputing Han ideographs.

Han ideographs (Chinese characters) are used in Chinese and Japanese (as Kanji), occasionally in Korean and historically with Vietnamese. Unicode encodes over 70,000 Han ideographs, but at least 90% of these are extremely rare. Still, thousands come up in everyday situations and need to be typed into computers.

There are a number of ways of inputting Han ideographs. One way is through a Romanization. This requires several dictionaries, because each language pronounces the ideographs differently, and there are many Romanization systems for each language. One Romanization may correspond to more than one ideograph, so users are presented with a list of numbered choices. Another input method works by specifying the radicals, or component pieces, of the Han ideographs.

Japanese and Korean (and Vietnamese, discussed above) also have alphabets. Japanese has the kana—katakana and hiragana—a syllabary which can be entered by a special keyboard or through Romanization. Korean has Hangul, (I blogged about this earlier) which are syllables made of the 24 letters in the jamo alphabet. These can be entered by Romanization or by typing jamo. In either case, things are expected to coalesce into Hangul syllables automatically during typing.

The choice of input methods is usually chosen outside of the current application, and applications are expected to work perfectly with that choice. A good text editing widget must communicate with the OS settings to pick the right one, and it must implement it properly.

Rendering difficult scripts

Hebrew, European scripts and East Asian scripts are relatively simple to render. Well, maybe not that simple, when it comes to placing (and stacking!) combing marks and putting Hangul syllables or Han ideographs. But at least two adjacent letters don't tend to combine with each other spontaneously, forming several types of ligatures depending on what's on the left and what's on the right. At least letters go in their prevailing order, not rearranging themselves visually.

Certain Middle Eastern scripts like Arabic and Syriac, and several Indic scripts like Devanagari, don't follow these rules. Arabic and Syriac are written in a cursive form. Each letter has up to four (or sometimes five) forms in the simplified form that most computer typesetting systems use, and vowels are written above or below the characters as combining marks. In some computer typesetting systems, these shapes can be inferred from context, but others will display things incorrectly unless special code points are used which include information about which form it is.

Indic scripts are also complicated, but differently. The term "Indic script" refers not only to scripts used in India but a whole family of scripts used around South and Southeast Asia. There are 19 of these scripts encoded in Unicode, all with somewhat similar properties. Devanagari, the script used for Hindi, Nepali and Sanskrit, among other things, is one of these Indic scripts, and others include Malayalam, Tibetan, Tamil, Khmer and Thai. Among consonants in most Indic scripts, there are a large number of ligatures, but these are not encoded into Unicode; they must be inferred by a suitable font. Vowels in Indic scripts can be written after a consonant group, above it, below it, or even before the group, and this must also be inferred by the font. (I'm oversimplifying in both of those sentences, of course. Things are much more complicated.) In most Indic scripts' Unicode encodings, vowels which are placed before a consonant group come aftewards in memory, following logical order. Thai and Lao scripts don't follow these rules, and put those vowels before, but this causes complications in collation. All of these properties of Indic scripts create difficulties not only in rendering but also in input methods, but I haven't read much specifically on this.

Some or all of these issues might be solved by choosing a suitable font and display system rather than the editor widget itself.

Line breaking

I wrote about grapheme breaking previously. Unicode line breaking is sort of like grapheme breaking, except much much more complicated. By "line breaking" I mean what you might call word wrap. It's described in UAX #14.

This document describes several classes of characters with different line-breaking properties. Some force a line break before or after their display, others passively allow a line break before or after, some forbid a line break and others have even more complicated properties. All in all, there are 38 categories. Even though the algorithm is complicated, it (like generating a collation key) can be done in linear time with respect to the length of the string.

With respect to line breaking, different scripts and locales have different rules. In Chinese, Japanese and Korean text, where spaces between words are rare or nonexistent, you can do a line break between any two Han ideographs, kana or Hangul syllables. Often, arbitrary line breaks are even allowed within words written in Roman script. In Korean, we have to watch out that we don't break up a Hangul syllable, so conjoining jamo behavior rules are used.

In Khmer, Lao and Thai scripts, there are also no spaces between words, but line breaks must not interrupt words. This requires some linguistic analysis and dictionary lookup, but Unicode provides a code point to indicate line breaking opportunities. This may sound a little weird and impractical, but a similar thing also exists for European scripts which may hyphenate words between lines: the soft hyphen.

Even if things are restricted to American English, Unicode line breaking properties provide for better visual display than separating things at word breaking points, or just where there's whitespace. Still, it provides a significant implementation challenge for the writer of an internationalized text editor.

Bidirectional text

In Arabic, Hebrew and some other scripts, text runs from right to left rather than left to right. But what if we have a document that combines Hebrew and English: how do we organize text running in two directions? In an even more common situation, we could have text containing Hebrew and numbers. (Even in Hebrew and Arabic, decimal numbers are written left to right.) According to Unicode principles, both left-to-right (LTR) and right-to-left (RTL) scripts should be written in logical order, yet their display order is different.

To negotiate these issues, the Unicode Consortium specified the Bidirectional (BIDI) Algorithm (UAX #9). The BIDI Algorithm could easily fill up a whole blog post by itself, and I don't fully understand it, so I won't try to explain it all here. There are a number of different categories that Unicode code points go in for BIDI purposes: characters can be not only LTR or RTL, but also neutral, mirroring, strongly or weakly directioned, or an invisible special-purpose mark. These categories sometimes overlap, and there are several levels of directioning.

The BIDI algorithm is difficult to implement for display, but it gets even more complicated in a text editor. When placing the cursor somewhere, there can be ambiguity as to where it is. When selecting things, there is sometimes a design decision whether to support logical selection (which could be discontiguous) and visual selection (which is hard to implement). The best resource to learn more about this is the UAX document itself.

Conclusion

For Factor, it will be impossible to keep the editor widget written completely in Factor. It would require a huge amount of duplicated effort and system-specific configuration to get it right, so we'll probably end up having to use native (meaning Gtk, Cocoa and Windows UI) editor widgets.

Clearly, the Unicode-blind approach is insufficient here. Because of all this, internationalized editor widgets are implemented pretty rarely. English-speakers can get complacent in simple input methods, but these will not work for others.

Warning: Readers should be aware that I haven't studied any one of these topics in-depth. This is just an introduction and shouldn't be taken very seriously.

Sunday, December 23, 2007

Books that should exist

As you can probably guess by the existence of this blog, I love to write. Part of what I love about it is the actually act of writing, but it's really more that I want things to be written that hadn't before been written. I want to write stuff that I wish I could have read. Right now, there are three books and one journal article that I think should exist that. Hopefully, I'll be able to write some of these some time in my life.

Implementing Unicode: A Practical Guide

One thing that struck me when beginning to write the Unicode library is that there aren't many books about Unicode. The two I found in my local Barnes and Noble were the Unicode 5.0 Standard and Unicode Explained. Looking on Amazon.com, I can't find any other books that address Unicode 3.1 (the version that moved Unicode from 1 to 17 planes) or newer in detail, ignoring more specialized books.

Both of these were both great books, but they aren't optimal for figuring out how to implement a Unicode library. Unicode Explained focuses on understanding Unicode for software development, but shies away from details of implementation. The Unicode Standard explains everything, but it often gets overly technical and can be difficult to read for people not already familiar with the concepts described. A Unicode library implementor needs something in the middle. Unicode Demystified might be the right one, but it describes Unicode 3.0, so it is in many ways outdated.

I wish a book existed which explained Unicode in suitable detail for most library implementors. If this book continues to not exist for many years, I might just have to write it myself. This, however, would be more difficult and less fun than my other book ideas.

[Update: After getting my hands on Unicode Demystified, I realize that I should not have thrown it aside so quickly. It's a great book, and nearly all of it is still relevant and current. It looks like the ebook version I have is updated for Unicode 3.1.]

Programming with Factor

Factor's included documentation is great, but it's not enough, by itself, to learn Factor. I, and probably most people who know Factor, learned through a combination of experimenting, reading blog posts and the mailing list, and talking on #concatenative. Many people will continue to learn Factor this way, but it still seems somehow insufficient. It should be possible to learn a programming language without participating in its community.

Of course, we can't write a book on Factor until we get to see what Factor will look like in version 1.0. But I'm confident that this book will be written, even if it goes unpublished in print, and I'm fairly sure that I'll have at least some part in it. Maybe it'll be a big group effort by the whole Factor community.

What I'm thinking is that we could have a book which teaches programming through Factor, to people who aren't yet programmers. I've talked a lot about this with Doug Coleman, and we agree that most programming books are bad; we should make a new book that does things very differently. But we can't agree or imagine just how...

The Story of Programming

I, like many of you reading this blog, have an unhealthy interest in programming languages. Mine may be a little more unhealthy than yours. Whenever I hear the name of a programming language that I don't know of, I immediately need to read about it, to get some basic knowledge of its history, syntax, semantics and innovations.

Most study of programming languages works by examining the properties of the languages themselves: how functional programming languages are different from imperative object-oriented languages and logic languages and the like. But what about a historical perspective? The history of programming languages is useful for the same reason other kinds of historical inquiry are useful. When we know about the past, we know more about the way things are in the present, and we can better tell what will happen in the future. The history of programming languages could tell us what makes things popular and what makes things ultimately useful.

Unfortunately, not much has been done on this. Knowledge of programming language history is passed on, unsourced, with as much verifiability as folk legend. The ACM has held three conferences called HOPL on this subject over the past 30 years, so all the source material is there. But apart from a book published in 1969, this is all I can find as far as a survey history of programming languages goes.

There is a limit to how much academic research papers can provide. The proceedings of the HOPL conferences aren't bedtime reading, and they don't provide much by way of a strong narrative. A new book could present the whole history of programming from the first writings about algorithms to modern scripting languages and functional programming languages so it's both accessible to non-programmers and interesting to programmers. As far as I know, no one's really tried. But it would be really fun to try.

Inverses in Concatenative Languages

Most of my ideas are either bad or unoriginal. But there's one idea that I came up with that seems to be both original and not horrible, and that's the idea of a particular kind of concatenative pattern matching (which I blogged about, and Slava also wrote about in relation to units).

The basic idea is that, in a concatenative language, the inverse of foo bar is the inverse of bar followed by the inverse of foo. Since there are some stacks that we know foo and bar will never return (imagine bar is 2array and the top of the stack is 1), this can fail. From this, we get a sort of pattern matching. Put more mathematically, if f and g are functions, then (f o g)^-1 = g^-1 o f^-1.

In my implementation of this, I made it so you can invert and call a quotation with the word undo. We don't actually need a full inverse; we only need a right inverse. That is, it's necessary that [ foo ] undo foo be a no-op, but maybe foo [ foo ] undo returns something different. Also, we're taking all functions to be partial.

Anyway, I think this is a cool idea, but I doubt it could fill a book. I want to write an article about it for an academic journal so that I can explain it to more people and expand the idea. It could be made more rigorous, and it could use a lot more thought. I hope this works.

When will I write these?

So, I hope I'll be able to write these things. Unfortunately, I'm not sure when I could. I need to finish my next 10 years of education, which won't leave tons of free time unless I give up blogging and programming and my job. Also, I'm not sure if I'm capable of writing a book or an article for an academic journal, though maybe I will be in 10 years when I'm done with my education. It wouldn't be so bad if someone stole my thunder and wrote one of these things because what I really want to do is read these books.

Update: Here are a couple more books (or maybe just long surveys) that should exist but don't: something about operational transformation, and something about edit distance calculations and merge operations for things more complicated than strings. Right now, you can only learn about these things from scattered research papers. (I never thought I'd find the references at the end so useful!)

Thursday, December 13, 2007

Alphanum sort and Unicode

So, I'll just be the Unicode troll here and say, all of your solutions to the problem of sorting alphanumeric file names are insufficient. To recap, if you haven't been following Reddit: it's annoying how, when looking at a directory, the file named z10.txt is sorted between z1.txt and z2.txt, and we programmers should come up with a solution. The gist of the solution is that you split up the number into a list of alphabetical and numeric parts, parse the numbers, and then sort it all.

What's wrong with that?

Every one of these implementations, including a slightly internationalized version (just for some Scandanavian locale, apparently), does not produce a sort the way humans expect it. Remember, in ASCII, capital Z comes before lower-case a. Jeff Attwood hinted at internationalization problems, and that's just part of the problem. We should use the Unicode Collation Algorithm (which I previously blogged about). The UCA provides a locale-dependent, mostly linguistically accurate collation for most locales (basically all that don't use Han ideographs).

The basic idea that we need here is that the UCA is based on a number of levels of comparison. This is useful not only in "international" locales, but also in United States English. First, we compare two strings with most features removed, then we add back in accent marks, and then capitalization, and then punctuation/spacing. So, even if we want "B" to come before "b", we can have "bench" come before "Bundle"--before comparing for capitalization, we compare for the letters themselves, and "bench" obviously precedes "bundle". For a better description of the UCA, see my past blog post.

So, the most basic way to put the UCA together with the Alphanum algorithm is to just use the same algorithm, except with the UCA to compare the strings rather than an ASCII comparison like most other implementations have done. But this would completely destroy the UCA's subtle level effect: even if "A" is sorted before "a", we would want "a1" to come before "A2". We also want "A1" to come before "a2", and we also don't want to ignore case completely ignore case; it's still significant.

The UCA solves this problem when we're working with "a1" and "A2", but not when we're comparing "a2" and "a10". For this, we need a slightly more complicated solution based on something like the Alphanum algorithm. But breaking the string up by itself won't allow for the layering kind of behavior that the UCA depends on.

What's a solution for this?

One way to fix this is to break the string up into its segments, right-pad the alphabetical strings and left-pad the numbers. We can just break things up into segments of consecutive digits and non-digits. The length that we'll pad to for the nth subsequence is the maximum of the lengths of the nth subsequences for all of the strings. For numbers, we can left-pad with plain old 0, but for strings, we want to
right-pad with something that's lower than any other character. This sounds hard to do, but by tailoring the DUCET (data table for the UCA), we can just define a character in a private use space as a new, special padding character. This padding character will be completely non-ignorable, fixed weight and have a lower primary, secondary and tertiary weight than any other character by definition.

OK, that sounds really complicated. Let's step back a minute and see how this padding could work out to provide functionality equivalent to the existing Alphanum algorithm. For this, we'll assume access to an ASCII-order sorting algorithm, and just figure out how to pad the string in the right way to come up with a suitable sort key. Instead of some new character, we can just use a null byte. So, if we have the strings "apple1b" and "the2345thing1", the fully padded strings should look like "apple0001b\0\0\0\00" and "the\0\02345thing1", where '\0' is the null byte.

I can make a simple Factor implementation by stealing Doug's cut-all code from his description of a solution to this problem.


: pad-seq ( seq quot -- newseq ) 
    >r dup [ length ] map supremum r> 
    curry map ; inline

: pad-quote ( seq -- newseq )
    [ "" pad-right ] pad-seq ;

: pad-number ( str -- newstr )
    [ CHAR: 0 pad-left ] pad-seq ;

: pad-string ( str -- newstr )
    [ 0 pad-right ] pad-seq ;

: pieces ( strings -- pieces )
    [ [ digit? ] [ ] cut-all ] map pad-quote flip ;

: pad ( pieces ? -- ? newpieces )
    [ pad-string f ] [ pad-number t ] if swap ;

: pad-strings ( strings -- newstrings )
    pieces t swap [ swap pad ] map nip flip [ concat ] map ;

: alphanum-sort ( strings -- sorted )
    dup pad-strings [ 2array ] 2map sort-values keys ;

This, by far, isn't the most concise implementation, but the advantage is that it can easily be adapted for a better collation algorithm.

But where's the real Unicode implementation of this?

I'm not sure where to implement this in terms of tailoring and hooking it up with the UCA. I'd either have to (a) go into C++ and do this with ICU (b) write my own Unicode library which has tailoring, and do it there. I'm too scared to do the first, and the second isn't done yet. I've looked at Java's tailoring, but I don't see how I could define a character that's lower than all other characters. I feel a little bit lame/trollish putting this blog post out there without a real implementation of the solution, but I hope I was able to expand on people's knowledge of the concerns of internationalization (yes, I'll write that word all out; I don't need to write i18n) in text-based algorithms.

Wednesday, November 21, 2007

When Unicode doesn't just work itself out

So, in my job as a student worker at Carleton College, I'm working on programming the website. Just a week or so ago, I basically completed a classified housing ads module, which I'm fairly happy about, even if it was pretty trivial. They've made their own in-house CMS, now open-source, written in PHP and called Reason. Being a very respectful employee, I won't allude to any possible irony in this name. Not at all.

Anyway, a couple of days ago, one of the non-student workers, Margaret (the names of those involved have been swapped with others' to protect their identity) came in with a question: can anybody, who's not doing something important, come up with a regular expression to detect bad UTF-8? Since I was basically unoccupied, reading a book from the library in preparation for my next super-secret project, and since I know a little about Unicode, I volunteered.

I sat down with Margaret as she explained the problem: somehow, a bunch of bad Unicode had gotten into Carleton's database. This isn't a bug that we can reproduce, but a number of times in the past, there have been these question marks, like �, appearing in the content. Whenever it comes up, we delete it, but we need a regular expression to find all of the bad parts.

When this situation occurs, there is malformed UTF-8, where UTF-8 is the encoding used internally in Reason. Reason uses the Unicode-blind approach I discussed earlier. The Unicode-blind approach failed here, though. Most likely it's because of strange, invalid input from browsers which can't easily be replicated, except by artificially constructing queries. All good web designers know that you can't trust the browser (a fact which I internalized only recently) and that apparently extends down to the validity of UTF-8 itself.

Margaret's idea was to find characters whose most significant bit was on, surrounded by characters whose most significant bit was off. However, there are other places where Unicode could be malformed and checking for this specific case isn't enough. With a bit of cursory Googling, we found phputf8, a pure-PHP library which has functions for validating, cleaning, encoding and decoding UTF-8. We could use this library to clean all input, to ensure that it is in valid UTF-8 and will appear without � all the time. This would be a bit computationally intensive, though.

Another approach is to take advantage of the database itself. Reason works on MySQL, but what character encoding does it use? Since Margaret didn't know, I asked Eric, our omnipotent server admin. Carleton actually has one SQL server instance where there are many different databases. Some of these databases use UTF-8, Eric told me, but the main Reason database is encoded in ISO Latin 1, as far as SQL is concerned. The content is, in truth, encoded in UTF-8; when Reason was started, MySQL was in version 3 and never supported Unicode, and we continue to use the same database. Eric guided me through the migration in a wonderfully hackish way: serialize the whole database, then in that serialization, replace all instances of "latin1" with "UTF8", then deserialize the database.

There are still a couple potential problems: for one, it's possible that there could be invalid UTF-8 in the database that MySQL doesn't check for. The best way to fix this would be to do a one-time cleanup with something like phputf8 before the migration. Otherwise, things will be sort of corrupted from the start. After this, it shouldn't be possible to put invalid UTF-8 in the database. The second thing is that there are still some remaining Unicode issues, most notably normalization. Though we could trust the browser to give us text in NFC, that's generally a bad idea. It'd be better to normalize all text, and MediaWiki has code for that.

Once all strings are considered UTF-8 by the database, things should move much more smoothly. Not only will invalid UTF-8 not be accepted, but collation will work properly, since MySQL supports Unicode collation consistent with the Unicode Collation Algorithm (UCA) and Default Unicode Collation Element Table (DUCET). But there's a minor caveat: MySQL 5.0 and lower only supports three octets of UTF-8. In 2001, Unicode 3.1 introduced the possibility of a fourth octet, for supplementary planes, used for obscure Han characters, math symbols, old scripts, etc. MySQL 5.0 doesn't allow any of these to be used in UTF-8 mode, though the Unicode-blind approach does allow it.

Nevertheless, it's good to be aware of Unicode. The fact that you can sometimes get away without noticing Unicode and its encodings is not a reason to ignore it.

Update: Alexander Barkov, a MySQL developer, pointed out to me that MySQL 6.0 supports three encodings, utf8, utf16 and utf32, including support for code points outside the BMP. Previous versions of MySQL had only three octets of utf8, or the fixed-width ucs2, which only allows the BMP. I'm very happy about this, but of course it takes some time to upgrade versions.

Tuesday, October 16, 2007

Unicode implementer's guide part 5: Collation

What could be simpler than alphabetical order¹? It's a simple lexicographic comparison of two strings, comparing the first character to see which is greater, then the second as a tie breaker, then the third, and so on. Unicode makes a lot of things harder, but this shouldn't be any different. Unicode doesn't prohibit this kind of alphabetical order; you can have a 'Unicode-conformant' process which uses this order.

Great idea, but...

This kind of order, which we could call code point order, is useless. Well, it's not completely useless; it's fine if you need an arbitrary, consistent order. But for humans, it's useless. It puts Z before a. In Normalization Form D (which I'll assume strings are in for the rest of the article), it puts the "äa" after "az". It puts "aa" after "a-z". It puts "æ" after "af". All of these are backwards for users in the American English locale and most other locales.

It's not just that the letters are in the wrong order. It also won't be sufficient to put all characters in a consistent case, as the R6RS Unicode module authors seem to think, as this will still give unnatural behavior for punctuation, accent marks, ligatures, and other things. Removing all of these features won't do it either; they can't be completely ignored, becase "foot ball" should come consistently after "football", and "Apple" should come consistently either before or after "apple", depending on the needs of the user.

It's crucial that things are sorted correctly; if they are in a different order than the user expects, then a user, looking at a long alphabetized list, may conclude that an item on the list doesn't exist.

A better way for people

To get things right, the way humans expect them, you need another few levels of tiebreakers. When comparing two strings to see which one comes first, there's a number of steps: First, compare two strings with all of the features stripped. If they're equal, compare the accent marks of the strings, as a tie breaker. If those are equal again, compare the capitalization as a second tie breaker. And if those are still equal, compare the punctuation.

I left out one thing--ligatures like æ. This can be expanded to something that looks like "ae" in the first stage, but looks different in the accent mark tiebreaker. We'll call them expansions. In some languages, like Danish, there are also contractions like "aa" with their own special alphabetization; these contract into a single value for the initial sort, if the Danish locale is used. In Thai and Lao scripts, there is an additional problem of certain vowels and consonants reversing order for collation purposes. This can be achieved with a combination of the expansion and contraction patterns. There's also a special problem for Korean collation².

UTS #10

Believe it or not, I didn't come up with this all by myself. It's part of the Unicode Standard, in Unicode Techinical Standard #10: Collation. UTS #10 isn't something that's required of Unicode implementations, but it is specifically standardized and codified. It does basically what I described in the section above, with a slight change for efficiency. Because the folks at the Unicode Consortium did all this work, we don't have to! They've even given us a convenient table of data that makes it easier to implement the above tiebreakers.

UTS #10 suggests a simple way for implementing these tiebreakers that works for efficiently for sorting large lists of strings: calculate a collation key, a bit array, for each string, and compare these collation keys with a simple bit comparison. This collation key can be calculated in linear time and space with respect to the string, so a human-appropriate comparison is asymptotically no worse than a naive one and suffers only linear overhead.

Collation keys and the DUCET

But how do we make the collation key? Well, that table I mentioned earlier is called DUCET, the Default Unicode Collation Element Table. It describes a neutral locale which is linguistically inaccurate everyone. For most³ characters in NFD, there is a three-tuple⁴ of the primary, secondary, and tertiary collation weights. In assembling the collation key from a string, go through the string, first appending the primary weights of each character⁵, then a separator of 0, then all of the secondary weights, followed by another separator, and finally the tertiary weights. If one of these weights for a particular character is 0, leave it out. This is all followed by a representation of the punctuation⁶.

A simple binary comparison of the collation keys ends up being equivalent to what I described before. How? When the primary weights are appended, they form something equivalent to the characters all put in a consistent case, with accent marks and punctuation removed. In a binary comparison of collation keys, these are compared first. They are followed by a separator of 0 so that, if the strings' primary weights are equivalent up to the end of the shorter one, the longer string will be sorted second. This is because 0 is less than any collation weight. The secondary collation weight represents the accent marks, and the tertiary collation weight represents the capitalization. (See note 5 about punctuation.)

Implementation and tailoring

I wish I could give hints on implementation, but, umm... I haven't written a working one yet. It's quite an involved algorithm, at least compared to things I've done before. Luckily, the UTS gives lots of hits for how to do it. One thing that makes implementation easier is that the collation weights of many characters, mostly Han ideographs, can be derived algorithmically through a formula specified in UTS #10. In fact, this is somewhat unfortunate, as it is a symptom of the fact that the ordering of Han ideographs requires a more involved process, using some sort of dictionary lookup for pronunciation or radical index, depending on the context.

But implementation comes with an additional challenge: tailoring to specific locales. The DUCET isn't linguistically accurate for any locale, and must always be modified somewhat. Different contexts such as language, nation, context and organizational preference can demand slightly different collation orders that users depend on. Modifying the collation table to suit this is known as tailoring. I'm not yet sure what algorithm should be used to reshuffle the collation weights in the right way. But there is a ton of locale data in the Common Locale Data Repository (CLDR), a project run by the Unicode Consortium. This includes information about collation.

Conclusion

Collation isn't easy, but it's necessary to get right. UTS #10 lays out an algorithm for many aspects of collation, but still leaves some things out: there's no specified process for ignoring words like 'the' in the beginning of a book title, or padding numbers with zeros so that 2 comes before 10. Nevertheless, with this algorithm and appropriate locale data, it's possible to give most users a linguistically accurate alphabetical order which makes sense to humans.

¹ Actually, we should probably call it collation, since alphabetical order implies we're working with an alphabet. With Unicode, some scripts don't use an alphabet.

² There are a number of different ways to sort Korean Hangul syllables appropriately. The precomposed Hangul syllables are already in the correct collation order, but a problem arises in the representation of old Hangul texts, where multiple initial, medial or final jamo can be used. (In modern Korean, only one is used per syllable.) There is more than one way to sort this. Because of an apparent bureaucratic argument between the Unicode Consortium and the ISO (who publishes the parallel standard ISO 10646), none of these are encoded in the DUCET. The one that sounds the simplest to implement is to decompose all syllables and place a special terminator weight after the end of each Hangul syllable. More details are in UTS #10.

³ I say 'most' because Han characters and Hangul syllables are left out, as their weights can be derived algorithmically. Maybe I shouldn't say 'most', as most characters are in that group, though.

⁴ Actually, it's a four-tuple, but the last entry isn't very useful for anything. I think it's the same as the shifted strategy for variable weighting, described in note 5.

⁵ Actually, to make expansions and contractions work, you shouldn't be iterating over characters; you should iterate over collation elements. For example, in Slovak, 'ch' forms a collation element with a distinct single primary weight.

⁶ There are many different ways to represent the punctuation, but that is beyond the scope of this article. Generally, we don't refer to it as 'punctuation' but rather 'variable-weighted'. The most useful variant (they all have applications, however) is called 'shifted', adding an algorithmically-derived fourth level of weight to compare variable-weighted characters.

Saturday, September 15, 2007

When Unicode just works itself out

I love college; there are so many smart people here. One smart person I met named Paul designed his own CMS (it started as a forum) using PHP and MySQL called Really Cool Stuff while in high school. It has tons of features, looks pretty well-designed, and he has real, non-programmer users. (He just upgraded and wiped the site, so it's a little empty right now.) I'm jealous.

I've heard many bad things about PHP's Unicode handling, so I asked Paul what he thought about it. He told me that he didn't write any special code for it. In PHP, the strings are composed of normal 8-bit characters. In MySQL, an 8-bit string is also used. Convinced I had found a bug, I opened up a new forum post and entered in a bunch of special characters, convinced I would see a mess of misrendered partial code points from the lack of explicit handling.

Disappointingly, the program worked properly. All of the characters displayed as they were entered.

What happened?

It was too simple for me to realize first: it's UTF-8 all the way through.

UTF-8 is the encoding which is overwhelmingly used for transmission over the internet, due it its backwards compatibility with ASCII and efficiency in representing ASCII. Like nearly all Unicode encodings, UTF-8 is represented as a string of bytes; each character represented as 1, 2, 3 or 4 bytes. In this case, the only thing the website needed to do was to get the string from input, store it in the database, and then serve it back out on a different page. PHP generally works with 8-bit strings, but here, it doesn't really matter. Modern web browsers treat everything as UTF-8 by default. Within PHP and MySQL, it's all just bytes that have no inherent meaning. It really doesn't matter whether the bytes are characters or UTF-8 octets, because the web browser controls the presentation.

So these two basic operations, storage and retrieval, happen to work just fine here. The input is UTF-8, and the output is interpreted as UTF-8, so everybody's happy. You could call this the Unicode-bind approach. This is basically all Paul's program does with strings: no fancy manipulation, just input and output. But there's a third operation going on here, implicit in lookup. The titles, usernames, etc. are similarly stored in 8-bit strings, and looking them up in the database involves, ultimately, testing two strings to see if they are equal.

Wait a minute!

This should raise a red flag here. A simple byte-equality comparison is being used when the strings may be in different, but canonical-equivalent, forms! (I discussed normalization and equivalence in detail earlier. Two strings are considered canonical equivalent if they are the same after being put in NFD (or NFC).) That is, there might be strings which are in different in what code points are used, but represent the same glyphs. A simple example of this is the difference between a lower-case a with an acute accent (one character), and a lower case a followed by a combining acute accent (two characters). Both of these represent the same piece of text, so they should be treated the same. The Unicode standard suggests that they be treated this way.

However, in the real world, this is rarely an issue. Probably it will never come up among Paul's users. From browser input, text is usually in a form that's mostly kinda like Normalization Form C (NFC). In any case, it's nearly always consistent among the people who use non-ASCII characters, so it will rarely create a problem.

It's disappointing, but many popular programs and libraries, including many Unicode fonts, don't treat canonical-equivalent strings the same. The Unicode standard only says that no process shall presume that another process treats any two canonical-equivalent strings differently, and suggests that it'd be nice if canonical-equivalent strings were always treated as the same. The practical manifestation of this is that, in many Unicode fonts, a with an acute and a with a separate combining acute are displayed differently. In the case of Paul's program, a user with an accent mark in his name will probably only be able to log in if he enters his user name in the right form. The username might show up on his screen as the same, but if the user uses, say, combining accent marks instead of precomposed characters, the database might not find him. But, again, this will rarely occur.

So, what about other operations?

If you go beyond the simple operations of storage, lookup, and equality comparison, you'll run into problems in internationalization. Concatenation should work OK too. According to Paul, these four operations are all the website needs to do with strings, so everything should work out OK. But there are four classes of string operations that can cause problems in the know-nothing Unicode approach described above. These may become important as Paul continues to develop his CMS.

Dealing with individual characters. There are two layers of problems here. First, the string is in UTF-8, so it takes a bit of work to extract characters. (I previously wrote about the format of UTF-8, halfway down that page.) Accessing a character in the normal byte offset way just won't work. The default substring and length functions also don't work, and you can't loop through indexes of a string the way you may be accustomed to. Second, in many situations, it's more appropriate to think of graphemes than characters. (Case study with another Carleton programmer.) Individual processing by characters may not always work.

An immediate consequence of the lack of individual character processing is that regular expressions do not work. Additionally, when using Unicode, regular expressions must not use old patterns like [a-zA-Z] to match alphabetic characters, as this fails when more diverse texts are considered. More subtly, many regular expressions implicitly depend on assumptions like 'words are separated by blanks or punctuation', which are inappropriate in general.

Alphabetical order. I still have yet to write about this in detail, but alphabetical order is much more complicated in than you may think. Simple programs which are only used with ASCII often get away by sorting in code point order, after converting the string to a consistent capitalization. But in general, this doesn't work. In this scheme, a with an accent mark will sort after z. Due to the complicated tie-breaking rules governing accent locale, marks, ligatures, contractions, capitalization and punctuation in proper sorting, the comparison operators included in most programming languages by default (usually code point order) will not be appropriate for international applications. For example, in American English, an a-e ligature sorts like an a followed by an e, and 'café' sorts in between 'cafe' and 'cafes'. The Unicode Collation Algorithm (UCA) sorts this all out.

Word breaking and search. When considering all of Unicode, it is inappropriate to assume that words are separated by blanks. For example, in Japanese, adjacent Kanji should be treated as separate words. This has implications when searching by whole word match. For insensitive search, it often makes sense to use part of the collation key, which I'll describe later.

Uppercase, lowercase, and titlecase. With a totally naive approach, capitalizing a UTF-8 string the way you capitalize an ASCII string, you would only see the letters in the ranges A-Z and a-z transformed properly, with other characters completely ignored. (This is half lucky accident, half by design of UTF-8.) Apparently, PHP has some support for UTF-8 in some of its string functions, including case transformations. But by default, only the ASCII transformation is done. (It's possible to do everything except for Turkish, Azeri and Lithuanian case operations using the same function, but PHP requires an explicit change of locale.)

(Of course, if you do something like write a rich AJAXy text editor, more problems will also show up.) But if you don't use any of these aspects of text manipulation, your program should be fine for the most part, without explicit Unicode support.

This isn't to say that it's OK to ignore Unicode, though. Most complex programs will soon run into one of these issues, but I still think this is an interesting case study. I feel like I've said this a million times, but the best way to avoid running into any of these issues is to have them solved in the standard library correctly. This way, all string manipulations are Unicode-aware, and more complicated things like breaking properties can be easy to access on all systems.

Note about databases: Most modern SQL databases support some Unicode in some way or another, by which I mean they can store strings in particular Unicode encodings. MySQL 4.1 looks like it has reasonably good Unicode support, including several encodings (with some non-Unicode ones, too), and collation based on the UCA, with support for tailoring the order to specific locales. But unfortunately, it looks like only characters in the Basic Multilingual Plane (BMP) are supported. Note that when storing UTF-8 code in a database Unicode-blind, more characters might be taken up in the database than there actually are in the string when displayed, because some characters take multiple octets.

Tuesday, September 4, 2007

Strings are now complicated

It's too late to turn back now. Now that the Unicode floodgates have been opened, and we're trying to support every script by default, string processing has gotten a bit more complicated.

In my pre-frosh orientation trip, I met another kid who enjoys programming. He told me about a little PHP script he wrote called Synesthesia. From his description, I imagine it goes something like this. First, generate a CSS file with classes named c0 through c255, each with a random text color. Then, take an input text file, and generate an HTML document from it. Put each character in a span specifying the class as "c" plus the ISO Latin 1 (8859-1) number associated with that character after the character has been put in lower case. The lower case is done so that "t" and "T" have the same color. The effect of this program is really cool, assigning each character a random color, making interesting designs and patterns from a plain text.

Now, how can we make this work with every language? It gets just a little bit more difficult. First of all, before doing anything, the string should be normalized. The Unicode standard explicitly states that any Unicode-conformant process must treat all canonical-equivalent strings equally. The easiest way to do this is to get the string in a consistent form, by normalizing it in some way.

When coloring a letter, we should be thinking about graphemes, not code points. Code points are the 21-bit basic units of Unicode text, whereas a grapheme is a unit of display, for example, an e with an accent mark over it, or a Hangul syllable. So we want to color units of display, not, say, just an accent mark. Iteration through graphemes is a feature of all sufficiently useful Unicode libraries. We can put the span around the whole grapheme, not just one code point.

When all of Unicode is considered, conversion to lower case isn't enough to cancel out all of the unimportant differences between similar graphemes. A better way to do it is to take the initial primary collation weight (of all non-ignorable characters) and use that as the unique identity for coloration. Just one little problem: there are, potentially, 65536 primary collation weights. That's a pretty big CSS file, so, where possible, only the used weights should be generated as CSS classes for text colors.

If you don't follow this (which is likely, as I haven't discussed collation in detail yet, and Unicode is confusing), I don't blame you. Unicode-aware text processing is complicated. While the Unicode consortium has made sure to allow less intelligent programs to be called conformant, they still might not be as good or readily internationalizable (is that a word?) as they could be.

Of course, it's ridiculous to expect programmers to be aware of all of this. That's why it's extremely important to have a good, easy-to-use standard library for Unicode, which abstracts as many things as possible away from the programmer. The optimal case, which I'm working towards, is to have nearly everything be Unicode-aware by default.

Monday, August 27, 2007

Unicode implementer's guide part 4: grapheme breaking

I have to be honest: I lied. I oversimplified the semantics of grapheme cluster breaking, and I hope no one messed up any software over it. But luckily, as far as I know, no one is following this and implementing Unicode, so I didn't cause too much damage. Anyway, it seems that an outright, objective error doesn't inspire nearly as much rebuke as a mere controversial essay. (I like rebuke. Please rebuke me if I've made a false claim in this post.)

Unicode grapheme breaking, like everything else in Unicode, is unexpectedly complex. But, as always, there exists an efficient algorithm to process it.

What's a grapheme cluster?

Abstract Unicode strings, like the ones you're used to, are a series of characters. In Unicode parlance, these are often called code points, but I'll usually call them characters here. But these characters aren't what users think of when they hear "character." Some characters, such as accent marks, go on top of, or underneath, other characters in a horizontal text. Some characters, like \r and \n (CR and LF) often come one after another, signifying together just one thing: a line break.

Because of these and other complications, an additional concept of a "grapheme" is useful. A grapheme is a series of one or more code points that, together, forms what a user thinks of as a character. When you use your left and right arrow keys to scroll through text, you're iterating through graphemes, not characters. If you want to reverse a string, you reverse the graphemes, not the characters. These are just a couple of the many contexts where graphemes show up. Any self-respecting Unicode library needs to be able to detect grapheme cluster boundaries. (I'll use 'grapheme' interchangably with 'grapheme cluster' for simplicity.)

The rules

The primary source for the relevant Unicode algorithm here is Unicode Standard Annex (UAX) #29: Text Boundaries. This describes not only grapheme boundaries but also word and sentence boundaries. Another related document is UAX 14: Line Breaking Properties. I will discuss all of these in a later post, but for now, I haven't even started implementing them. Grapheme cluster boundaries are described in terms of the junction or disjunction of adjacent characters. The core of the rules is described in section 3.1. I've simplified them down to ignore characters that won't be present in our NFD-normalized strings. In the below rules, × signifies that the two adjacent code points belong to one grapheme cluster, and ÷ indicates that there is a grapheme cluster break between those characters.

Break at the start and end of text.
GB1.	sot	÷
GB2.		÷	eot
Do not break between a CR and LF. Otherwise, break before and after controls.
GB3.	CR	×	LF
GB4.	( Control \| CR \| LF )	÷
GB5.		÷	( Control \| CR \| LF )
Do not break Hangul syllable sequences.
GB6.	L	×	( L \| V )
GB7.	V	×	( V \| T )
GB8.	T	×	T
Do not break before extending characters.
GB9.		×	Extend
Otherwise, break everywhere.
GB10.	Any	÷	Any

The character groups L, V, T, CR, LF, Extend, Control and Any (which I'll call grapheme classes, a name I invented) are defined in the UAX. Their intuitive definitions are simple, though. The Extend grapheme class encompasses all characters which are marked as extenders, and the Control grapheme class can be thought of as containing all control and line-breaking characters, except CR, LF, ZWNJ and ZWJ. (I'll discuss the last two later.) CR and LF are just that--the characters \r and \n, respectively. These need special handling. L is initial jamo characters, V is jamo medial, and T is jamo final. Any is anything else. sot is the start of the text, and eot is the end. Note that, in the above rules, earlier rules override later rules.

A first shot

My first attempt at implementation didn't go so well. At first, I implemented grapheme cluster breaks without even glancing at the UAX. This was incredibly stupid. I acted like there were two rules, GB9 and GB10. And I misunderstood Extend to be the same as the group of non-starters, that is, characters with a non-zero combining class. Using this incorrect definition, which I repeated in several blog posts, I made the words next-grapheme and prev-grapheme. These words take an index and a string and locate the next or previous grapheme break. (In reality, prev-grapheme never worked, and I had insufficient unit tests so I didn't catch this.)

This simple protocol gives you everything you need to iterate forwards and backwards through graphemes in strings. Using it, it is possible to define a word which generates an array of all of the graphemes in the string:


: (>graphemes) ( i str -- )
    2dup bounds-check? [
        dupd [ next-grapheme ] keep
        [ subseq , ] 2keep (>graphemes)
    ] [ 2drop ] if ;

: >graphemes ( str -- graphemes )
    [ 0 swap (>graphemes) ] { } make* ;

And with this, it's easy to reverse a string, preserving graphemes rather than breaking them up. If we didn't preserve graphemes when reversing strings, the resulting reversal would be incoherent and represent something different, with, for example, accent marks on the wrong character.


: string-reverse ( str -- rts )
    >graphemes reverse concat ;

Note that the concat here does not need to be normalizing, whereas normally, concat and append need to do a little bit of reordering to be properly normalized in NFD.

This interface, along with >graphemes and string-reverse, works independently of the bad implementation of next-grapheme. So, when I fixed that, in the two following versions, I was able to retain the same code for >graphemes and string-reverse.

Fixing some bugs

So, my first version worked for my simple unit tests, but when I expanded testing, I soon found that it failed. This led to a much more complicated version of next-grapheme, which used mountains of conditionals to determine if two graphemes are conjoining.

Most of the grapheme class detection was simple, involving only character classes and simple ranges. But Extend is slightly more complicated. The Extend grapheme class is defined as code points with the property Grapheme_Extend, but there's a more direct definition. Looking in the description page for the Unicode Character Database (UCD), we see that Grapheme_Extend is defined as the characters in the categories Me and Mn, plus a few extras in Other_Grapheme_Extend. Other_Grapheme_Extend is defined in PropList.txt. It's only 21 characters, but rather than enter them in manually, I decided to write a parser for PropList.txt so that the update to the next version of Unicode is easier. (Actually, I cut out only the relevant part of the file, because it will be included in the Factor distribution, and there's no sense in including thousands of lines of text that will not be used.)

There were a few problems: first, every character was tested for its grapheme breaking properties multiple times, where one time would have sufficed. Second, the code was structured in a way that required me to rewrite the whole thing for reverse traversal!

Looking at UAX 29, I saw several references to a 'table-based implementation', but there was no explanation of this. Confused, I wrote to the Unicode list ([email protected], an active, public mailing list) asking for an explanation of this. The Unicode list is useful but a bit technical, and my only response was a pointer to the ICU implementation of breaking properties, the BreakIterator. Naively, I looked up the BreakIterator class in the ICU source code, imagining that the code governing it would be there, or in one of the classes that that file directly references. But after looking at maybe 10 files of C++, I gave up. I'm not smart enough for C++, or at least for these big projects.

Through the ICU documentation, I saw that a closely related problem was line breaking properties. Looking at the Unicode spec for that (UAX 14), I got an actual description of table-based implementations. The feeling was something between an epiphany and a realization of incredible stupidity.

Table-based implementation

Here's the basic idea of how to implement grapheme cluster breaks, the table-based way:

Step 1: Before runtime, create a table describing which grapheme classes connect and which grapheme classes disconnect.

Step 2: Between two grapheme classes, you can query the table to see if there's a grapheme cluster break

Step 3: To detect grapheme clusters, iterate through a string and determine the grapheme class of each character, until the classes have a break between them.

The table is an 8x8 bitarray, where the the rows represent the grapheme class of the character before the potential break and the columns represent the character after the potential break. Simply indexing it tells us whether there is a grapheme break. Once you have this table, and the facility to tell which grapheme class each character is in, the task is easy to write. In essence we're making a finite state machine, where the state is completely determined by the previous character. For more details, you can check out my implementation.

Generating the table

But let's look at Step 1 in more detail, which is decidedly not as easy. How do we get from the rules to the table? Sure, you could generate the table manually, since it's not very large, but that process is annoying, error-prone and creates difficult-to-maintain code. A better way to go about it is to generate the table directly from the rules.

No, this doesn't require any fancy DSLs, just a simple algorithm with some utility functions. For this description, assume false does not equal zero. First, initialize an GxG array to false, where G is the number of grapheme classes. Then, define the operation connect to set the array at a point to 1 if it is currently false, else do nothing. Define the operation disconnect to set the array at a point to 0 if that point is currently false. These allow the table to give precedence to earlier operations and ignore overlapping later ones. Once we're done with all of the connects and disconnects, generate the final table by making all points which are 1 true, and all points which are 0 or false, false.

On top of this backing, I implemented a few functions. connect-before connects one class to a list of classes, to implement rules like "L × ( L | V )". connect-after connects a list of classes to a class, to implement rules like " × Extend" (whose 'list of classes' is all of the classes). break-around disconnects all junctions between the first list and the second list, to implement " ÷ ( Control | CR | LF )" (using, again, all classes as one of the arguments). Once I I had these functions, generating the table from the rules was a simple transliteration:


    CR LF connect
    { Control CR LF } graphemes break-around
    L { L V } connect-before
    V { V T } connect-before
    T T connect
    graphemes Extend connect-after

So...

The solution ended up being more complex than I thought, but at least it wasn't unintelligible. I'll go back now and fix all of my incorrect descriptions of grapheme breaking in previous blog entries. Along these same lines, word breaking, sentence breaking line breaking are implemented. But they're more complex, involving optional breaks in the case of line breaking, and non-adjacent characters in the case of word/sentence boundaries. But this is all for a later post!