Levenshtein Automata

A Kotlin implementation of Levenshtein Automata for fast fuzzy string searching in large dictionaries. This project implements various optimizations of the algorithm described in the paper "Fast String Correction with Levenshtein-Automata" by Klaus Schulz and Stoyan Mihov (2002).

Overview

Levenshtein Automata provide an efficient way to find all words in a dictionary that are within a specified edit distance (Levenshtein distance) of a query word. This is particularly useful for:

Spell checking and autocorrection
Fuzzy search in large text corpora
DNA sequence matching
Query suggestion systems

Traditional approaches compute the Levenshtein distance for each word in the dictionary, resulting in O(n×m) complexity per word, where n and m are the lengths of the compared strings. Levenshtein Automata precompute a finite state machine that can traverse a trie-based dictionary in linear time relative to the dictionary size, making fuzzy search significantly faster.

Features

Multiple implementation versions (v1 through v7) showcasing progressive optimizations
Trie-based dictionary for efficient prefix sharing
Precomputed DFA (Deterministic Finite Automaton) construction for maximum performance
Support for edit distances up to 9
Comprehensive benchmarking suite
Dictionary with 370,000+ English words

Project Structure

src/
├── dictionary/
│   ├── trie.kt              # Trie data structure implementation
│   └── wordDictionary.kt    # Dictionary loader
├── levensteinAutomata/
│   ├── levensteinDistance.kt    # Basic Levenshtein distance (v1)
│   ├── levenshteinv3.kt         # Optimized runtime automata construction
│   ├── levenshteinv7.kt         # Precomputed DFA (fastest)
│   └── ...                      # Other implementation versions
├── benchmark.kt             # Performance benchmarking
├── profile.kt              # Profiling utilities
└── Main.kt                 # Entry point

data/
├── actual_words.txt        # 370,105 English words
└── queries.txt             # 35,392 test queries

benchmarkResults/           # Benchmark comparison results

Algorithm Evolution

v1: Naive Levenshtein Distance

Basic recursive implementation that computes Levenshtein distance for each dictionary word. Extremely slow but correct.

v3: Runtime Automata Construction

Constructs a Levenshtein automaton on-the-fly for each query. The automaton tracks states as (offset, dUsed) pairs:

offset: Position in the query string
dUsed: Number of edit operations used

The automaton processes each character by considering:

Deletion: Skip character in dictionary word (cost: 1)
Insertion: Skip character in query (cost: 1)
Substitution: Replace character (cost: 1)
Match: Characters are equal (cost: 0)

See implementation at src/levensteinAutomata/levenshteinv3.kt:1

v7: Precomputed DFA (Recommended)

The fastest implementation that precomputes all possible states and transitions for a given edit distance. Uses characteristic vectors to encode character positions efficiently.

Key optimizations:

One-time DFA construction: Build the automaton once for a given distance
Characteristic encoding: Bit vector representation of character positions
State normalization: Minimizes the number of unique states
Maximum chi-width: 3 × D + 1 for efficient encoding

See implementation at src/levensteinAutomata/levenshteinv7.kt:1

Performance

Benchmark results comparing v3 (runtime construction) vs v7 (precomputed DFA) on 10,000 queries against a 370,000-word dictionary:

Algorithm    Distance    Total (ms)    ns/query    Throughput (q/s)
------------------------------------------------------------------------
v3           D=1         1,056.49      105,649     9,465
v7           D=1         932.21        93,221      10,727
    → v7 is 11.8% faster than v3

v3           D=2         11,963.62     1,196,362   835
v7           D=2         9,656.02      965,602     1,035
    → v7 is 19.3% faster than v3

v3           D=3         71,798.83     7,179,883   139
v7           D=3         44,437.36     4,443,736   225
    → v7 is 38.1% faster than v3

Key Findings:

v7 achieves 10,727 queries/second at distance 1
Performance improvement increases with edit distance (38% faster at D=3)
DFA construction time is amortized across queries (only 386ms for D=3)

Usage

Basic Fuzzy Search

import dictionary.createWordDictionary
import levensteinAutomata.v7.LevenshteinAutomata
import levensteinAutomata.v7.fuzzySearchTrie

fun main() {
    // Load dictionary (370,000+ words)
    val dictionary = createWordDictionary()
    
    // Create precomputed automaton for edit distance 2
    val automata = LevenshteinAutomata(distance = 2)
    
    // Find up to 100 words within distance 2 of "hello"
    val results = fuzzySearchTrie(dictionary, "hello", automata, maxWords = 100)
    
    println(results)
    // Output: [hello, hell, hallo, jello, cello, ...]
}

Running Benchmarks

# Compile the benchmark
kotlinc src -include-runtime -d build/benchmark.jar

# Run the benchmark
java -jar build/benchmark.jar

The benchmark will compare v3 vs v7 implementations across multiple edit distances and output detailed performance metrics.

Implementation Details

Trie Structure

The dictionary is stored as a trie (prefix tree) where:

Each node represents a character
Nodes are marked as terminal if they complete a valid word
Children are stored in a HashMap for O(1) lookups

See implementation at src/dictionary/trie.kt:1

State Representation (v7)

States are represented as State(offset, dUsed) where:

offset: Number of query characters consumed
dUsed: Number of edit operations used so far

Multiple states are grouped into StateKey and normalized to minimize redundancy. The characteristic vector encodes which positions in a sliding window match the current character.

Edit Distance Constraints

The automaton enforces dUsed ≤ D at all times. When processing dictionary words:

Start at initial state (0, 0)
For each character, compute possible next states
Prune states where dUsed > D
Accept if remaining query can be deleted within budget

Limitations

Maximum supported edit distance: 9 (due to chi-width encoding limits)
Memory usage scales with O(|Σ|^(3D+1)) where |Σ| is alphabet size
DFA construction time increases significantly for D > 3

References

License

This is an educational implementation exploring the Levenshtein Automata algorithm and its optimizations.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.idea		.idea
benchmarkResults		benchmarkResults
data		data
src		src
.gitignore		.gitignore
README.md		README.md
levenshtein-automata.iml		levenshtein-automata.iml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Levenshtein Automata

Overview

Features

Project Structure

Algorithm Evolution

v1: Naive Levenshtein Distance

v3: Runtime Automata Construction

v7: Precomputed DFA (Recommended)

Performance

Usage

Basic Fuzzy Search

Running Benchmarks

Implementation Details

Trie Structure

State Representation (v7)

Edit Distance Constraints

Limitations

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Levenshtein Automata

Overview

Features

Project Structure

Algorithm Evolution

v1: Naive Levenshtein Distance

v3: Runtime Automata Construction

v7: Precomputed DFA (Recommended)

Performance

Usage

Basic Fuzzy Search

Running Benchmarks

Implementation Details

Trie Structure

State Representation (v7)

Edit Distance Constraints

Limitations

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages