Detects which stopwords in a query carry meaning ('the who', 'to be', 'how to') and retains them through retrieval rather than blindly stripping. Foundational query-understanding patent for short-tail intent.
Patent Overview
- Inventor
- Paul Haahr, Jeff Dean, others
- Assignee
- Google LLC
- Filed
- 2007
- Granted
- 2019-10-22
The Challenge
The Challenge
Classical IR strips stopwords (the, a, of, to, who, what) to reduce index size and noise. But stopwords often carry intent: 'the who' (the band) vs 'who' (interrogative); 'to be or not to be' is meaningful; 'how to fix' carries action intent. Stripping them silently degrades short-tail intent.
- Blind Stopword Stripping Loses Intent — Stripping 'the' from 'the who' loses the band reference. Many short queries depend on stopwords for intent.
- Meaningful Stopwords Are Context-Dependent — The same word is meaningful in one query, noise in another. Per-query meaningfulness assessment is required.
- Index Size Constrains Retention — Retaining all stopwords inflates the index. Selective retention based on meaningfulness is the design constraint.
- Detection Must Be Fast — Per-query meaningfulness detection runs in real time. Latency budget tight.
- Phrases Carry Meaning Beyond Single Words — Stop-phrases like 'how to', 'as a', 'in order to' carry intent at the phrase level. Detection must work at phrase scope.
Innovation
How The System Works
The system identifies meaningful stopwords and stop-phrases via statistical patterns over query logs and document corpora, retains them at indexing where appropriate, detects meaningful occurrences in queries at query time, and uses them as full ranking terms rather than stripping.
- Build Meaningful-Stopword Corpus — Statistical analysis over query logs identifies stopwords whose presence materially changes the result distribution.
- Detect Stop-Phrases — Phrase-scope analysis identifies multi-word stop sequences carrying meaning ('how to', 'as a', 'the who').
- Retain At Index — Documents containing meaningful stopwords retain those tokens in the index, not stripped.
- Per-Query Detection — Per query at query time, classify each stopword as meaningful or stripping-eligible based on context.
- Treat Meaningful Stopwords As Terms — Meaningful stopwords contribute to retrieval and scoring as full terms, not stripped.
- Strip Non-Meaningful — Stopwords classified as non-meaningful are stripped to reduce noise. Per-query classification balances retention and stripping.
- Continuous Update — Meaningful-stopword corpus updates as query patterns evolve. Detection adapts.
Stopwords Often Carry Intent
The patent's load-bearing idea is that stopwords are not always noise. Per-query, per-phrase meaningfulness classification preserves intent that blind stripping destroys.
Context Determines Meaningfulness
The same word is meaningful in one query and noise in another. The detection layer reads context — surrounding words, query length, query pattern — to classify per occurrence.
- Statistical Identification — Query logs and document corpora identify meaningful stopwords by their effect on result distribution.
- Phrase-Scope Detection — Multi-word stop-phrases ('how to', 'as a') detected at phrase scope, not single-word.
- Per-Query Classification — Per query at query time, each stopword classified meaningful or stripping-eligible based on context.
Technical Foundation
Technical Foundation
The patent specifies the meaningful-stopword identifier, phrase-scope detector, index retainer, per-query classifier, retrieval integrator, and corpus updater.
- Meaningful-Stopword Identifier — Statistical analysis over query logs and document corpora identifies stopwords with material effect on results.
- Phrase-Scope Detector — Identifies multi-word stop-phrases carrying meaning at phrase scope.
- Index Retainer — At indexing, retains meaningful stopwords in the index. Selective retention balances index size and meaning.
- Per-Query Classifier — Per query, per stopword, classifies meaningful or stripping-eligible based on context.
- Retrieval Integrator — Meaningful stopwords contribute to retrieval and scoring as full terms. Non-meaningful stripped.
- Corpus Updater — Meaningful-stopword corpus updates as query patterns evolve.
The Process
The Process
Statistical identification runs offline; index retention runs at indexing; per-query classification runs at query time.
- Identify Meaningful Stopwords — Offline, statistical analysis builds the meaningful-stopword corpus.
- Retain At Index — At indexing, documents containing meaningful stopwords retain those tokens.
- Receive Query — Query arrives at query time.
- Per-Stopword Classification — Per stopword in query, classifier reads context and classifies meaningful or strip.
- Strip Or Retain — Meaningful stopwords retained; non-meaningful stripped.
- Retrieve And Rank — Retrieval and ranking use retained stopwords as full terms.
- Update Corpus — Periodic corpus update as query patterns evolve.
Quality Control
Quality Control
Wrong classification degrades retrieval quality. The patent specifies safeguards.
- Statistical Threshold Calibration — Meaningful-stopword threshold calibrated against labeled query-result pairs. Mis-calibration produces either retention or stripping errors.
- Per-Query Context Reading — Per-query classification reads surrounding context. Single-stopword classification rejected.
- Index-Size Bounds — Index retention bounded to control index size. Selectivity is the trade-off control.
- Continuous Recalibration — Meaningful-stopword corpus and per-query classifiers recalibrate against fresh query log data.
- Multi-Language Coverage — Per-language meaningful-stopword corpora and classifiers. Stopword patterns differ across languages.
Real-World Application
Meaningful-stopword detection is foundational to short-tail query understanding. The pattern of selective retention based on per-query meaningfulness underpins modern query understanding across every search engine.
- Per-query Classification Granularity — Each stopword classified meaningful or strip based on per-query context.
- Phrase-scope Detection Scope — Multi-word stop-phrases detected at phrase scope, not single word.
- Statistical Identification Method — Query logs and document corpora drive meaningful-stopword identification via material-effect analysis.
Why Short Queries Depend On Stopwords
Short-tail queries often hinge on stopwords for intent. 'The who' versus 'who'; 'how to' versus 'how'. Meaningful-stopword retention preserves intent that blind stripping destroys.
Why Writing Naturally Helps
Content written in natural language preserves stop-phrase patterns that match how users actually query. SEO-optimized prose that strips connective stopwords may match less well against natural-query intent.
<\/section>What This Means for SEO
What This Means for SEO
This patent detects when stopwords carry meaning ('the who', 'how to', 'to be') and retains them through retrieval instead of stripping them. SEO implication: writing in natural language preserves the connective and stop-phrase patterns that short-tail queries depend on for intent.
- Short Queries Hinge On Stopwords — Intent in short queries often lives in the stopwords: 'the who' versus 'who', 'how to' versus 'how'. Content that preserves these phrasings matches the retained-stopword intent that blind stripping would destroy.
- Write Naturally, Do Not Strip Connectives — SEO-optimized prose that drops connective stopwords can match less well against natural queries. Natural language preserves stop-phrase patterns that mirror how users actually type.
- Stop-Phrases Carry Phrase-Level Intent — Multi-word stop sequences like 'how to', 'as a', and 'in order to' are detected at phrase scope and carry action or relational intent. Using these phrases naturally aligns your content with intent-bearing query phrases.
- Meaningfulness Is Context-Dependent — The same stopword is meaningful in one query and noise in another, judged per query. Coherent natural phrasing gives the classifier the context to read your stopwords as meaningful where it counts.
- Do Not Over-Compress For Keyword Density — Stripping articles and prepositions to densify keywords removes the very tokens that can be intent-bearing. Readable, complete sentences serve short-tail intent better than compressed keyword strings.
- Coverage Is Per-Language — Meaningful-stopword corpora and classifiers are built per language, since patterns differ across languages. Write naturally in each target language rather than translating keyword-stripped text.
- Match The Whole Query, Including The Glue — Because retained stopwords act as full ranking terms, the connective words in a query are part of what you must match. Phrase your content to include the natural glue, not just the content words.