Skip to content

dvelton/sift

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sift

Sift finds shared content between documents. Give it a collection of documents (a massive body of legislation, contracts, research papers, news articles, anything) and it builds a fingerprint index. Then hand it a new document, and it tells you in milliseconds which parts match something already in the collection, and where.

It works on any text: PDFs, Word documents, HTML pages, plain text files. A corpus of a million documents produces an index that fits on a laptop.

Why this is fast

Traditional text comparison checks a new document against every document in a collection. That gets slow fast. Sift takes a different approach: it converts each document into a compact set of fingerprint hashes and stores them in an index. Checking a new document means generating its fingerprints and doing hash lookups, not scanning text. A 10-page document queries against a 100,000-document index in under 100 milliseconds on a normal laptop.

Building the index takes one pass through each document (minutes to hours, depending on the corpus size). But once built, every query after that is nearly instant.

Corpus size Index time Index size Query time
1,000 docs ~1 min ~50 MB <50ms
10,000 docs ~10 min ~500 MB <50ms
100,000 docs ~2 hrs ~5 GB <100ms
1,000,000 docs ~20 hrs ~50 GB <200ms

The algorithm

Sift uses the winnowing algorithm, published by Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken at Stanford (SIGMOD 2003). The algorithm converts text into overlapping character sequences (k-grams), hashes each one, and then uses a sliding window to select a representative subset of those hashes as the document's fingerprint. Two documents that share a passage will share fingerprints at the corresponding positions, which is what makes the index lookups work.

Sift makes this available as a general-purpose, corpus-agnostic tool.

Installation

pip install -e .

For full format support (PDF, DOCX, HTML, web UI):

pip install -e '.[all]'

Or install only what you need:

pip install -e '.[pdf]'       # PDF support
pip install -e '.[docx]'      # Word documents
pip install -e '.[html]'      # HTML pages
pip install -e '.[web]'       # Web interface + URL fetching

Usage

Create a corpus and add documents

# Create a named corpus
sift create legislation --description "State bills 2020-2024"

# Add a single file
sift add legislation bill-sb1234.pdf

# Add a directory of files
sift add legislation ./bills/ --recursive --pattern "*.pdf"

# Add from a URL
sift add legislation https://example.com/document.html

Query a document against the corpus

sift query legislation new-bill.pdf

Output:

Query: new-bill.pdf
Corpus: legislation
Query fingerprints: 847
Search time: 23.4ms

Found 3 match(es):

  1. model-act-2019.pdf
     Similarity: 73.2% (620 shared fingerprints)
     Matching passages: 4
       [1] "a person who is not engaged in an unlawful activity and who is..."
       [2] "has no duty to retreat and has the right to stand his or her..."

  2. florida-sb-436-2005.pdf
     Similarity: 41.8% (354 shared fingerprints)
     ...

Output formats

sift query legislation new-bill.pdf --format json
sift query legislation new-bill.pdf --format html -o report.html
sift query legislation new-bill.pdf --format csv

The HTML report shows matching passages side by side with the source documents highlighted.

Web interface

sift serve

Opens a local web interface at http://127.0.0.1:8192 where you can upload documents, manage corpora, run queries, and view passage-level comparisons in your browser.

Other commands

sift list                          # List all corpora
sift list legislation              # Documents in a corpus
sift info legislation              # Corpus stats and parameters
sift remove legislation            # Delete a corpus
sift export legislation backup.json   # Export for sharing
sift import backup.json            # Import a shared corpus

Tuning

When creating a corpus, two parameters control sensitivity:

k (default: 25) -- The k-gram size. Matches shorter than this won't be detected. Lower values catch shorter shared phrases but produce more noise. Higher values only catch longer passages.

window (default: 20) -- The winnowing window. The algorithm guarantees detection of any shared substring at least k + window - 1 characters long (default: 44 characters). Larger windows produce smaller indexes.

# More sensitive (catch shorter matches)
sift create contracts --k 15 --window 10

# Less sensitive (only substantial passages)
sift create books --k 40 --window 30

Where data is stored

Sift keeps its data in ~/.sift/ by default (a single SQLite file). Change this with the --data-dir flag or the SIFT_DATA_DIR environment variable.

Acknowledgments

The winnowing algorithm:

Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken. "Winnowing: Local Algorithms for Document Fingerprinting." Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, 2003.

License

MIT

About

Superfast fingerprinting and overlap detection for any text corpus

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors