Skip to content

nivekithan/levenshtein-automata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Levenshtein Automata

A Kotlin implementation of Levenshtein Automata for fast fuzzy string searching in large dictionaries. This project implements various optimizations of the algorithm described in the paper "Fast String Correction with Levenshtein-Automata" by Klaus Schulz and Stoyan Mihov (2002).

Overview

Levenshtein Automata provide an efficient way to find all words in a dictionary that are within a specified edit distance (Levenshtein distance) of a query word. This is particularly useful for:

  • Spell checking and autocorrection
  • Fuzzy search in large text corpora
  • DNA sequence matching
  • Query suggestion systems

Traditional approaches compute the Levenshtein distance for each word in the dictionary, resulting in O(n×m) complexity per word, where n and m are the lengths of the compared strings. Levenshtein Automata precompute a finite state machine that can traverse a trie-based dictionary in linear time relative to the dictionary size, making fuzzy search significantly faster.

Features

  • Multiple implementation versions (v1 through v7) showcasing progressive optimizations
  • Trie-based dictionary for efficient prefix sharing
  • Precomputed DFA (Deterministic Finite Automaton) construction for maximum performance
  • Support for edit distances up to 9
  • Comprehensive benchmarking suite
  • Dictionary with 370,000+ English words

Project Structure

src/
├── dictionary/
│   ├── trie.kt              # Trie data structure implementation
│   └── wordDictionary.kt    # Dictionary loader
├── levensteinAutomata/
│   ├── levensteinDistance.kt    # Basic Levenshtein distance (v1)
│   ├── levenshteinv3.kt         # Optimized runtime automata construction
│   ├── levenshteinv7.kt         # Precomputed DFA (fastest)
│   └── ...                      # Other implementation versions
├── benchmark.kt             # Performance benchmarking
├── profile.kt              # Profiling utilities
└── Main.kt                 # Entry point

data/
├── actual_words.txt        # 370,105 English words
└── queries.txt             # 35,392 test queries

benchmarkResults/           # Benchmark comparison results

Algorithm Evolution

v1: Naive Levenshtein Distance

Basic recursive implementation that computes Levenshtein distance for each dictionary word. Extremely slow but correct.

v3: Runtime Automata Construction

Constructs a Levenshtein automaton on-the-fly for each query. The automaton tracks states as (offset, dUsed) pairs:

  • offset: Position in the query string
  • dUsed: Number of edit operations used

The automaton processes each character by considering:

  • Deletion: Skip character in dictionary word (cost: 1)
  • Insertion: Skip character in query (cost: 1)
  • Substitution: Replace character (cost: 1)
  • Match: Characters are equal (cost: 0)

See implementation at src/levensteinAutomata/levenshteinv3.kt:1

v7: Precomputed DFA (Recommended)

The fastest implementation that precomputes all possible states and transitions for a given edit distance. Uses characteristic vectors to encode character positions efficiently.

Key optimizations:

  • One-time DFA construction: Build the automaton once for a given distance
  • Characteristic encoding: Bit vector representation of character positions
  • State normalization: Minimizes the number of unique states
  • Maximum chi-width: 3 × D + 1 for efficient encoding

See implementation at src/levensteinAutomata/levenshteinv7.kt:1

Performance

Benchmark results comparing v3 (runtime construction) vs v7 (precomputed DFA) on 10,000 queries against a 370,000-word dictionary:

Algorithm    Distance    Total (ms)    ns/query    Throughput (q/s)
------------------------------------------------------------------------
v3           D=1         1,056.49      105,649     9,465
v7           D=1         932.21        93,221      10,727
    → v7 is 11.8% faster than v3

v3           D=2         11,963.62     1,196,362   835
v7           D=2         9,656.02      965,602     1,035
    → v7 is 19.3% faster than v3

v3           D=3         71,798.83     7,179,883   139
v7           D=3         44,437.36     4,443,736   225
    → v7 is 38.1% faster than v3

Key Findings:

  • v7 achieves 10,727 queries/second at distance 1
  • Performance improvement increases with edit distance (38% faster at D=3)
  • DFA construction time is amortized across queries (only 386ms for D=3)

Usage

Basic Fuzzy Search

import dictionary.createWordDictionary
import levensteinAutomata.v7.LevenshteinAutomata
import levensteinAutomata.v7.fuzzySearchTrie

fun main() {
    // Load dictionary (370,000+ words)
    val dictionary = createWordDictionary()
    
    // Create precomputed automaton for edit distance 2
    val automata = LevenshteinAutomata(distance = 2)
    
    // Find up to 100 words within distance 2 of "hello"
    val results = fuzzySearchTrie(dictionary, "hello", automata, maxWords = 100)
    
    println(results)
    // Output: [hello, hell, hallo, jello, cello, ...]
}

Running Benchmarks

# Compile the benchmark
kotlinc src -include-runtime -d build/benchmark.jar

# Run the benchmark
java -jar build/benchmark.jar

The benchmark will compare v3 vs v7 implementations across multiple edit distances and output detailed performance metrics.

Implementation Details

Trie Structure

The dictionary is stored as a trie (prefix tree) where:

  • Each node represents a character
  • Nodes are marked as terminal if they complete a valid word
  • Children are stored in a HashMap for O(1) lookups

See implementation at src/dictionary/trie.kt:1

State Representation (v7)

States are represented as State(offset, dUsed) where:

  • offset: Number of query characters consumed
  • dUsed: Number of edit operations used so far

Multiple states are grouped into StateKey and normalized to minimize redundancy. The characteristic vector encodes which positions in a sliding window match the current character.

Edit Distance Constraints

The automaton enforces dUsed ≤ D at all times. When processing dictionary words:

  1. Start at initial state (0, 0)
  2. For each character, compute possible next states
  3. Prune states where dUsed > D
  4. Accept if remaining query can be deleted within budget

Limitations

  • Maximum supported edit distance: 9 (due to chi-width encoding limits)
  • Memory usage scales with O(|Σ|^(3D+1)) where |Σ| is alphabet size
  • DFA construction time increases significantly for D > 3

References

License

This is an educational implementation exploring the Levenshtein Automata algorithm and its optimizations.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors