The foundational MinHash patent. Partitions documents into shingles, fingerprints them, applies pseudo-random permutations to generate sketches, and clusters objects sharing sufficient features. The probabilistic near-duplicate detection technique that underpins every modern dedup system at web scale.
Patent Overview
- Inventor
- Andrei Z. Broder, Steven C. Glassman, Charles G. Nelson, Mark S. Manasse, Geoffrey G. Zweig
- Assignee
- AltaVista Co
- Filed
- 2000-08-21
- Granted
- 2002-02-19
The Challenge
The Challenge
The web is full of near-duplicates: scraped reposts, syndicated articles, mirror sites, cosmetic-variant pages. Comparing every page pair exactly is O(N²) and infeasible at web scale. The system needs probabilistic near-duplicate detection that scales to billions of documents.
- Pairwise Comparison Doesn't Scale — Comparing every pair of pages exactly is quadratic. Web-scale demands sublinear algorithms.
- Cosmetic Variations Should Cluster Together — Pages differing by timestamps, ad rotation, or template wrapping are effectively duplicates. The technique must survive cosmetic noise.
- Substantively Different Pages Should Separate — Pages with different body content should not cluster together. The technique must distinguish substantive from cosmetic differences.
- Probabilistic Accuracy Beats Exact When Scaled — Exact comparison is too slow. Probabilistic methods that achieve near-exact accuracy with sublinear cost are the structural win.
- Fingerprint Compactness Matters — Per-document fingerprints must be compact to fit in memory at scale. Storage cost of fingerprint comparison is structural.
Innovation
How The System Works
The system shingles each document into overlapping token sequences, hashes the shingles, applies pseudo-random permutations to derive min-hashes, builds compact per-document sketches from the min-hashes, and clusters documents whose sketches share sufficient features.
- Shingle The Document — Per document, generate overlapping fixed-length token sequences (shingles). Each shingle is a local-content fingerprint primitive.
- Hash Each Shingle — Apply a hash function to each shingle. Per-document set of shingle hashes is the foundational fingerprint set.
- Apply Pseudo-Random Permutations — Per shingle-hash set, apply multiple pseudo-random permutations. Per permutation, retain the minimum hash — the MinHash value.
- Build Per-Document Sketch — Per document, the MinHash values across permutations form a compact sketch — the per-document fingerprint.
- Compare Sketches Probabilistically — Per pair of documents, the fraction of matching MinHash values estimates Jaccard similarity. Estimation accuracy scales with permutation count.
- Cluster By Similarity Threshold — Documents with sketch similarity above threshold cluster as near-duplicates.
- Apply Locality-Sensitive Hashing — For sublinear comparison, group sketches into bands so candidate near-duplicates emerge without all-pairs comparison.
Probabilistic Sketches Beat Exact Comparison
The patent's load-bearing idea is that probabilistic sketches achieve near-exact similarity estimation with sublinear cost. MinHash is the structural primitive that makes web-scale near-duplicate detection feasible.
Sketch Similarity Estimates Set Similarity
Per pair of documents, the fraction of matching MinHash values is an unbiased estimator of Jaccard similarity. The mathematical guarantee turns sublinear comparison into reliable similarity estimation.
- Shingled Tokenization — Per document, overlapping shingles capture local content structure. Survives cosmetic variation.
- MinHash Sketches — Per document, compact sketch of MinHash values enables fast comparison.
- Locality-Sensitive Banding — Sketch banding enables sublinear candidate identification without all-pairs comparison.
Technical Foundation
Technical Foundation
The patent specifies the shingler, hasher, permutation applier, sketch builder, comparator, clusterer, and LSH bander.
- Shingler — Per document, generates overlapping fixed-length token sequences.
- Hasher — Applies hash function to each shingle.
- Permutation Applier — Per shingle-hash set, applies multiple pseudo-random permutations; retains MinHash per permutation.
- Sketch Builder — Per document, MinHash values across permutations form compact sketch.
- Comparator — Estimates per-pair similarity from sketch overlap.
- LSH Bander — Sketches grouped into bands for sublinear candidate identification.
The Process
The Process
Fingerprinting runs at indexing; comparison runs as candidates emerge from banding.
- Shingle Per Document — Overlapping shingles generated at indexing time.
- Hash And Permute — Shingles hashed; permutations applied.
- Build Sketch — Per-document sketch built from MinHash values.
- Apply LSH Banding — Sketches grouped into bands.
- Identify Candidates — Documents in same band become near-duplicate candidates.
- Compare Candidate Sketches — Pairwise sketch comparison estimates similarity.
- Cluster Above Threshold — Near-duplicates clustered.
Quality Control
Quality Control
Probabilistic methods require tuning. The patent specifies safeguards.
- Permutation Count Tuning — Sketch size trades accuracy against storage. Tuned per workload.
- Shingle Size Tuning — Shingle length affects sensitivity to local edits. Too short over-clusters; too long under-clusters.
- Threshold Calibration — Similarity threshold for clustering calibrated against labeled near-duplicate pairs.
- LSH Band Tuning — Band parameters balance recall against false-positive candidate rate.
- Continuous Recalibration — Parameters recalibrate against fresh corpora as content patterns evolve.
Real-World Application
MinHash is foundational across every modern web-scale deduplication system, plagiarism detection, recommender systems, and AI-training-data curation. The probabilistic-sketch pattern is one of the most-cited techniques in modern IR.
- Shingle-based Comparison Unit — Overlapping shingles capture local structure.
- Probabilistic Comparison Method — Sketch similarity is unbiased estimator of Jaccard similarity.
- Sublinear Performance — LSH banding enables sublinear candidate identification.
Why Substantively Differentiated Content Survives Clustering
MinHash clusters cosmetically-similar pages. Substantively differentiated content — distinct structure, examples, voice, depth — survives clustering and gets independent ranking treatment.
Why AI-Training-Data Dedup Inherits This
MinHash is the standard technique for deduplicating AI training corpora. Content quality compounds when the unique work isn't drowned in near-duplicate scraped variants the training pipelines deduplicate first.
<\/section>What This Means for SEO
What This Means for SEO
MinHash shingles documents, fingerprints them, and clusters near-duplicates probabilistically at web scale. SEO implication: cosmetically-similar pages collapse into one cluster and compete as a single thing, so substantive differentiation is what survives.
- Cosmetic Variation Does Not Help — Shingling survives template wrapping, timestamps, and ad rotation by design. Spinning a page with reworded boilerplate or shuffled paragraphs lands you in the same near-duplicate cluster as the original, not a distinct ranking candidate.
- Substantive Differentiation Survives Clustering — Pages with distinct structure, examples, depth, and original framing produce different shingle sets and stay out of the duplicate cluster. Differentiation is structural, not cosmetic, and it is what earns independent ranking treatment.
- Syndication And Scraping Get Folded Together — Reposted and syndicated copies share most shingles with the source and cluster with it. Publishing identical content across many domains does not multiply your footprint, it consolidates it.
- Thin Local Or Doorway Variants Collapse — City-swap pages that change only a place name share nearly all shingles. Such variants cluster as near-duplicates instead of ranking as separate location pages, so each location page needs genuinely unique body content.
- Shingle Length Sets The Edit Sensitivity — The system tunes shingle size so that minor edits do not break clustering. A few word swaps will not lift duplicate content out of its cluster, which is why surface-level rewriting fails as a de-duplication tactic.
- Internal Duplication Wastes Crawl And Equity — Near-identical pages within your own site cluster too, splitting signals and crawl budget. Consolidating overlapping pages or canonicalizing them concentrates ranking signals on one strong version.
- AI Training Pipelines De-Dup The Same Way — MinHash is the standard de-duplication step for AI training corpora. Original, non-derivative content is the version that survives that filter and remains available as a citation source, while scraped near-duplicates get dropped first.