Skip to content

Support for Unicode Fuzzy suggestions #9

@dvirsky

Description

@dvirsky

Right now the trie and the Levenshtein automaton that handle prefix searches, operate on chars. This means that while we can store and search for prefixes using utf-8 non-latin strings (or any encoding for that matter), fuzzy matching will not work, because levenshtein distance is calculated in bytes and not codepoints / letters.

We need to operate on unicode runes, and not on chars. However, doing this with variable length encoding like utf-8 is difficult in the current implementation of the trie, and will probably reduce performance considerably.

In order to support this, the following steps will be needed:

  1. The trie and levenshtein automaton will be converted to operate on unicode wide characters or runes.
  2. However, to avoid having to hold 32-bits per letter in the trie, which will consume tons of memory, we'll use 16-bit unicode runes, ignoring anything that is not contained in Unicode BMP. This will work for all modern languages.
  3. The Levenshtein automaton will be converted to use 16-bit runes as well. Thus evaluation of any input string will have the correct results in terms of Levenshtein distance. (this means converting the DFA from using a 255 byte pointer array to the next state node, to a sparse array of pointers. This will slow down searches a bit, but can be sped up with sorting the array and using binary search.

Converting text

  1. We assume all input to FT.SUG* commands is valid utf-8.
  2. We convert the input strings to 32-bit unicode, optionally normalizing, case-folding and removing accents on the way. If the conversion fails it's because the input is not valid utf-8.
  3. We trim the 32-bit runes to 16-bit runes using the lower 16 bits. These can be used for insertion, deletion and search.
  4. We convert the output of searches back to utf-8.

Memory Penalty

The above will make the module consume more memory. However, the penalty will not be too bad:

  1. If all the strings in the trie are non-latin utf-8, there won't be much change, since these are already expressed as 2 or 3 bytes in utf-8.

  2. If all strings are latin ascii - while the text representations in the trie will take exactly X2 more memory - in reality, the penalty will be a lot less: Depending on the fragmentation of the tree, anywhere from 20 to 80% of the memory is pointers and metadata. Thus doubling the memory on the rest of the data will not double the memory consumption directly.

Any mix of 1 and 2 will yield a result in between.

  1. The Levenshtein automaton will take up more memory, but this will be negligible.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions