A Kotlin implementation of Levenshtein Automata for fast fuzzy string searching in large dictionaries. This project implements various optimizations of the algorithm described in the paper "Fast String Correction with Levenshtein-Automata" by Klaus Schulz and Stoyan Mihov (2002).
Levenshtein Automata provide an efficient way to find all words in a dictionary that are within a specified edit distance (Levenshtein distance) of a query word. This is particularly useful for:
- Spell checking and autocorrection
- Fuzzy search in large text corpora
- DNA sequence matching
- Query suggestion systems
Traditional approaches compute the Levenshtein distance for each word in the dictionary, resulting in O(n×m) complexity per word, where n and m are the lengths of the compared strings. Levenshtein Automata precompute a finite state machine that can traverse a trie-based dictionary in linear time relative to the dictionary size, making fuzzy search significantly faster.
- Multiple implementation versions (v1 through v7) showcasing progressive optimizations
- Trie-based dictionary for efficient prefix sharing
- Precomputed DFA (Deterministic Finite Automaton) construction for maximum performance
- Support for edit distances up to 9
- Comprehensive benchmarking suite
- Dictionary with 370,000+ English words
src/
├── dictionary/
│ ├── trie.kt # Trie data structure implementation
│ └── wordDictionary.kt # Dictionary loader
├── levensteinAutomata/
│ ├── levensteinDistance.kt # Basic Levenshtein distance (v1)
│ ├── levenshteinv3.kt # Optimized runtime automata construction
│ ├── levenshteinv7.kt # Precomputed DFA (fastest)
│ └── ... # Other implementation versions
├── benchmark.kt # Performance benchmarking
├── profile.kt # Profiling utilities
└── Main.kt # Entry point
data/
├── actual_words.txt # 370,105 English words
└── queries.txt # 35,392 test queries
benchmarkResults/ # Benchmark comparison results
Basic recursive implementation that computes Levenshtein distance for each dictionary word. Extremely slow but correct.
Constructs a Levenshtein automaton on-the-fly for each query. The automaton tracks states as (offset, dUsed) pairs:
offset: Position in the query stringdUsed: Number of edit operations used
The automaton processes each character by considering:
- Deletion: Skip character in dictionary word (cost: 1)
- Insertion: Skip character in query (cost: 1)
- Substitution: Replace character (cost: 1)
- Match: Characters are equal (cost: 0)
See implementation at src/levensteinAutomata/levenshteinv3.kt:1
The fastest implementation that precomputes all possible states and transitions for a given edit distance. Uses characteristic vectors to encode character positions efficiently.
Key optimizations:
- One-time DFA construction: Build the automaton once for a given distance
- Characteristic encoding: Bit vector representation of character positions
- State normalization: Minimizes the number of unique states
- Maximum chi-width:
3 × D + 1for efficient encoding
See implementation at src/levensteinAutomata/levenshteinv7.kt:1
Benchmark results comparing v3 (runtime construction) vs v7 (precomputed DFA) on 10,000 queries against a 370,000-word dictionary:
Algorithm Distance Total (ms) ns/query Throughput (q/s)
------------------------------------------------------------------------
v3 D=1 1,056.49 105,649 9,465
v7 D=1 932.21 93,221 10,727
→ v7 is 11.8% faster than v3
v3 D=2 11,963.62 1,196,362 835
v7 D=2 9,656.02 965,602 1,035
→ v7 is 19.3% faster than v3
v3 D=3 71,798.83 7,179,883 139
v7 D=3 44,437.36 4,443,736 225
→ v7 is 38.1% faster than v3
Key Findings:
- v7 achieves 10,727 queries/second at distance 1
- Performance improvement increases with edit distance (38% faster at D=3)
- DFA construction time is amortized across queries (only 386ms for D=3)
import dictionary.createWordDictionary
import levensteinAutomata.v7.LevenshteinAutomata
import levensteinAutomata.v7.fuzzySearchTrie
fun main() {
// Load dictionary (370,000+ words)
val dictionary = createWordDictionary()
// Create precomputed automaton for edit distance 2
val automata = LevenshteinAutomata(distance = 2)
// Find up to 100 words within distance 2 of "hello"
val results = fuzzySearchTrie(dictionary, "hello", automata, maxWords = 100)
println(results)
// Output: [hello, hell, hallo, jello, cello, ...]
}# Compile the benchmark
kotlinc src -include-runtime -d build/benchmark.jar
# Run the benchmark
java -jar build/benchmark.jarThe benchmark will compare v3 vs v7 implementations across multiple edit distances and output detailed performance metrics.
The dictionary is stored as a trie (prefix tree) where:
- Each node represents a character
- Nodes are marked as terminal if they complete a valid word
- Children are stored in a HashMap for O(1) lookups
See implementation at src/dictionary/trie.kt:1
States are represented as State(offset, dUsed) where:
offset: Number of query characters consumeddUsed: Number of edit operations used so far
Multiple states are grouped into StateKey and normalized to minimize redundancy. The characteristic vector encodes which positions in a sliding window match the current character.
The automaton enforces dUsed ≤ D at all times. When processing dictionary words:
- Start at initial state
(0, 0) - For each character, compute possible next states
- Prune states where
dUsed > D - Accept if remaining query can be deleted within budget
- Maximum supported edit distance: 9 (due to chi-width encoding limits)
- Memory usage scales with O(|Σ|^(3D+1)) where |Σ| is alphabet size
- DFA construction time increases significantly for D > 3
- Fast String Correction with Levenshtein-Automata (Schulz & Mihov, 2002)
- Levenshtein Distance - Wikipedia
This is an educational implementation exploring the Levenshtein Automata algorithm and its optimizations.