Generates SERP snippets from a tokenspace repository: a pre-tokenized, position-indexed document store that enables fast, query-aware snippet selection. The snippet generation pipeline that powers Google's result excerpts.
Patent Overview
- Inventor
- Jeffrey Dean, others
- Assignee
- Google LLC
- Filed
- 2010
- Granted
- 2012-11-27
The Challenge
The Challenge
SERP snippets must surface the most query-relevant passage from a document within milliseconds. Naive approaches re-fetch and re-tokenize per query, blowing the latency budget. A pre-tokenized repository changes the math.
- Snippets Are Latency-Critical — Snippet generation runs per query, per result. Milliseconds matter. Naive re-tokenization is too slow.
- Snippets Must Be Query-Aware — The best snippet depends on the query. Per-query passage selection requires fast access to per-document tokenized content.
- Document Storage Must Compress — Storing full document text per document at web scale costs storage. Compression is required without sacrificing access speed.
- Snippet Boundaries Must Make Sense — Snippets cut mid-sentence look bad. Boundary detection is required to produce readable excerpts.
- Multiple Snippets May Compete — Some queries match multiple passages. Per-passage scoring is required to select the strongest snippet candidate.
Innovation
How The System Works
The system tokenizes each document at indexing time, stores tokens in a position-indexed tokenspace repository, retrieves per-query candidate passages from the repository, scores passages by query relevance, applies boundary detection, and returns the top snippet.
- Tokenize At Indexing — Per document, tokenize content. Each token carries position information.
- Store In Tokenspace Repository — Tokens stored compressed in position-indexed repository. Fast random-access lookup supported.
- Per-Query Passage Candidates — Per query, locate token positions matching query terms. Position clusters become passage candidates.
- Score Passages — Per candidate, compute query-relevance score. Term coverage, term proximity, and position context contribute.
- Detect Snippet Boundaries — Sentence boundaries near top-scoring passages identified. Snippet cropped to clean boundary.
- Format Snippet — Query terms bolded; ellipsis added where cropped. Output is SERP-ready snippet text.
- Cache Where Appropriate — Per popular query, snippets cache. Cache invalidation tied to document update or query-pattern shift.
Pre-Tokenized Speed
The patent's load-bearing idea is that the tokenspace repository pre-pays the tokenization cost at indexing time, leaving snippet generation as a fast position-lookup operation that fits within SERP latency budgets.
Pre-Compute What You Can
Tokenization is expensive but query-independent. Pre-computing tokens at indexing time and storing them efficiently makes snippet generation viable at web scale, query-by-query.
- Position-Indexed Tokens — Per document, tokens stored with position metadata. Enables fast per-query passage candidate location.
- Compressed Storage — Tokenspace compressed without sacrificing random-access speed. Storage cost manageable at web scale.
- Boundary-Aware Cropping — Sentence-boundary detection produces readable snippets. No mid-word or mid-clause cuts.
Technical Foundation
Technical Foundation
The patent specifies the tokenizer, tokenspace repository, position index, passage scorer, boundary detector, and snippet formatter.
- Tokenizer — Per document at indexing time, tokenizes content into position-tagged tokens. Stopword handling, normalization applied.
- Tokenspace Repository — Compressed, position-indexed storage of per-document tokens. Random-access lookup supported.
- Position Index — Per-document position-to-token map. Enables fast per-query passage candidate location.
- Passage Scorer — Per candidate passage, computes query-relevance score from term coverage, proximity, and position.
- Boundary Detector — Sentence-boundary detection near top-scoring passages. Crops snippet to clean boundaries.
- Snippet Formatter — Applies query-term bolding, ellipsis for cropped regions. Output is SERP-ready snippet text.
The Process
The Process
Tokenization runs at indexing time; snippet generation runs per query. Pre-paid tokenization keeps query-time latency low.
- Tokenize At Indexing — Per document, tokenize into position-tagged tokens. Store in compressed repository.
- Receive Query — Query arrives. Tokenize query terms.
- Locate Passage Candidates — Per result document, find token positions matching query terms. Position clusters form passage candidates.
- Score Passages — Per candidate, compute query-relevance score.
- Detect Boundaries — Sentence-boundary detection crops snippet to clean cut.
- Format Snippet — Query terms bolded, ellipsis added where cropped.
- Return To SERP — Snippet returned to SERP renderer. Optional caching for popular queries.
Quality Control
Quality Control
Snippets are user-visible quality signals. The patent specifies safeguards.
- Boundary Detection Accuracy — Sentence boundaries must be detected reliably. Mid-sentence cuts produce poor snippets.
- Passage-Score Calibration — Passage scoring calibrates against user clicks and dwell. Snippets that drive engagement validate the scorer.
- Multi-Passage Diversity — When the document offers multiple strong passages, the strongest selected. Per-passage scoring discriminates.
- Length Bounds — Snippet length bounded by SERP layout constraints. Excess truncated cleanly.
- Adversarial-Content Filtering — Snippets that surface manipulative content (cloaked text, hidden divs) filtered. Snippet must reflect what the user will see on the page.
Real-World Application
Snippet generation is the user-facing distillation of the document. The tokenspace repository pattern underpins fast snippet selection across modern search engines.
- Pre-tokenized Indexing Strategy — Tokenization runs at indexing time. Query-time snippet generation is fast position-lookup.
- Position-indexed Lookup Speed — Per-document position-to-token map enables fast per-query passage candidate location.
- Boundary-aware Snippet Quality — Sentence-boundary detection produces readable snippets without mid-clause cuts.
Why Front-Loading Key Phrases Helps
Snippet selection rewards passages with high query-term coverage and proximity. Content that surfaces key phrases near the start of paragraphs and sentences is more likely to yield strong snippets.
Why Clean Sentence Boundaries Matter
Boundary detection crops to sentence ends. Well-structured prose with clear sentence boundaries produces clean snippets. Run-on or fragment-heavy writing produces worse snippets.
<\/section>What This Means for SEO
What This Means for SEO
This patent generates SERP snippets fast by selecting query-relevant passages from a pre-tokenized, position-indexed repository, with sentence-boundary cropping. SEO implication: write clear, well-structured prose with key phrases surfaced early so the snippet engine can extract a strong, readable excerpt.
- Front-Load Key Phrases — Passage selection rewards high query-term coverage and proximity. Surfacing the key answer near the start of paragraphs and sentences increases the chance of a strong snippet that matches the query.
- Clean Sentence Boundaries Win — Boundary detection crops snippets to sentence ends. Well-formed sentences produce clean excerpts; run-on or fragment-heavy writing yields worse snippets that may underperform on clicks.
- The Best Passage Competes Per Query — Multiple passages are scored and the strongest selected per query. A page that answers several related sub-questions clearly gives the engine strong candidates for different queries.
- Cloaked Or Hidden Text Is Filtered — Snippets that would surface manipulative content like hidden divs are filtered, and the snippet must reflect what the user actually sees. Do not hide snippet-bait text the visitor will not encounter.
- Snippets Are Length-Bounded — Snippet length is constrained by layout, and excess is truncated. Make your most important phrasing concise enough to fit within a typical snippet window.
- Snippet Quality Feeds Engagement — Passage scoring calibrates against clicks and dwell, so snippets that drive engagement validate the selection. A page whose excerpts earn clicks reinforces its own surfacing over time.
- Structure Helps Extraction — Because tokenization and position indexing happen at crawl time, clear structure makes relevant passages easy to locate. Logical paragraphing and direct answers improve what the engine can pull.