Sift finds shared content between documents. Give it a collection of documents (a massive body of legislation, contracts, research papers, news articles, anything) and it builds a fingerprint index. Then hand it a new document, and it tells you in milliseconds which parts match something already in the collection, and where.
It works on any text: PDFs, Word documents, HTML pages, plain text files. A corpus of a million documents produces an index that fits on a laptop.
Traditional text comparison checks a new document against every document in a collection. That gets slow fast. Sift takes a different approach: it converts each document into a compact set of fingerprint hashes and stores them in an index. Checking a new document means generating its fingerprints and doing hash lookups, not scanning text. A 10-page document queries against a 100,000-document index in under 100 milliseconds on a normal laptop.
Building the index takes one pass through each document (minutes to hours, depending on the corpus size). But once built, every query after that is nearly instant.
| Corpus size | Index time | Index size | Query time |
|---|---|---|---|
| 1,000 docs | ~1 min | ~50 MB | <50ms |
| 10,000 docs | ~10 min | ~500 MB | <50ms |
| 100,000 docs | ~2 hrs | ~5 GB | <100ms |
| 1,000,000 docs | ~20 hrs | ~50 GB | <200ms |
Sift uses the winnowing algorithm, published by Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken at Stanford (SIGMOD 2003). The algorithm converts text into overlapping character sequences (k-grams), hashes each one, and then uses a sliding window to select a representative subset of those hashes as the document's fingerprint. Two documents that share a passage will share fingerprints at the corresponding positions, which is what makes the index lookups work.
Sift makes this available as a general-purpose, corpus-agnostic tool.
pip install -e .
For full format support (PDF, DOCX, HTML, web UI):
pip install -e '.[all]'
Or install only what you need:
pip install -e '.[pdf]' # PDF support
pip install -e '.[docx]' # Word documents
pip install -e '.[html]' # HTML pages
pip install -e '.[web]' # Web interface + URL fetching
# Create a named corpus
sift create legislation --description "State bills 2020-2024"
# Add a single file
sift add legislation bill-sb1234.pdf
# Add a directory of files
sift add legislation ./bills/ --recursive --pattern "*.pdf"
# Add from a URL
sift add legislation https://example.com/document.html
sift query legislation new-bill.pdf
Output:
Query: new-bill.pdf
Corpus: legislation
Query fingerprints: 847
Search time: 23.4ms
Found 3 match(es):
1. model-act-2019.pdf
Similarity: 73.2% (620 shared fingerprints)
Matching passages: 4
[1] "a person who is not engaged in an unlawful activity and who is..."
[2] "has no duty to retreat and has the right to stand his or her..."
2. florida-sb-436-2005.pdf
Similarity: 41.8% (354 shared fingerprints)
...
sift query legislation new-bill.pdf --format json
sift query legislation new-bill.pdf --format html -o report.html
sift query legislation new-bill.pdf --format csv
The HTML report shows matching passages side by side with the source documents highlighted.
sift serve
Opens a local web interface at http://127.0.0.1:8192 where you can upload documents, manage corpora, run queries, and view passage-level comparisons in your browser.
sift list # List all corpora
sift list legislation # Documents in a corpus
sift info legislation # Corpus stats and parameters
sift remove legislation # Delete a corpus
sift export legislation backup.json # Export for sharing
sift import backup.json # Import a shared corpus
When creating a corpus, two parameters control sensitivity:
k (default: 25) -- The k-gram size. Matches shorter than this won't be detected. Lower values catch shorter shared phrases but produce more noise. Higher values only catch longer passages.
window (default: 20) -- The winnowing window. The algorithm guarantees
detection of any shared substring at least k + window - 1 characters long
(default: 44 characters). Larger windows produce smaller indexes.
# More sensitive (catch shorter matches)
sift create contracts --k 15 --window 10
# Less sensitive (only substantial passages)
sift create books --k 40 --window 30
Sift keeps its data in ~/.sift/ by default (a single SQLite file). Change
this with the --data-dir flag or the SIFT_DATA_DIR environment variable.
The winnowing algorithm:
Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken. "Winnowing: Local Algorithms for Document Fingerprinting." Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, 2003.
MIT