Rejects a synonym for a multi-word phrase when that synonym only captures part of the phrase and silently drops the rest of the information, protecting qualified queries from being collapsed into their head terms.
Patent Overview
- Inventor
- Steven D. Baker
- Assignee
- Google LLC
- Filed
- 2007-12-12
- Granted
- 2014-02-25
- Application Number
- US 11/955,251
The Challenge
Synonyms That Quietly Lose Information
When the synonym pipeline replaces a multi-word phrase with a shorter form, the substitution can drop meaning. "Apple computer" might be replaced with "Apple", which is technically a synonym pair seen in many contexts, but the substitution drops the "computer" component and changes intent. A robust pipeline needs to detect when a proposed phrase-level synonym is actually only synonymous with a part of the phrase and reject those substitutions before they reach retrieval.
- Partial Substitution Drops Intent — Replacing "red shoes" with "shoes" might be a valid synonym in some narrow contexts, but for the query at hand, the color qualifier was load-bearing. The substitution loses the user's explicit constraint.
- Sub-Component Synonyms Bleed Through — If "Apple" is a known synonym for "Apple computer" (via its sub-component) and also for many other things, the substitution silently broadens the query past the user's actual intent. The bleed-through is hard to detect after the fact.
- Need An Information-Preservation Guard — The pipeline needs to detect when a proposed phrase-level synonym is actually only synonymous with a part of the phrase, and reject those substitutions before they reach retrieval.
- Head-Term Dominance Skews Mining — Upstream synonym mining tends to find more evidence for head terms (single-word concepts) than for the qualified phrases that contain them. The bias means partial-match candidates surface easily.
- Qualifier Semantics Cannot Be Recovered — Once a qualifier is dropped at retrieval time, there is no way to restore it. The substitution must be vetoed before it is applied.
Innovation
Identify Synonyms For Sub-Components, And Veto Overlap
The system identifies both the candidate synonym for the whole phrase and the synonyms for each sub-component. If the phrase-level candidate is also a synonym of a sub-component, the candidate is treated as information-dropping and is rejected as a phrase-level synonym. The check is symmetric across all sub-components of the phrase.
- Receive Candidate Phrase Synonym — A candidate synonym for the full query phrase arrives from upstream mining with its supporting evidence.
- Decompose The Phrase — Identify each sub-component of the phrase (individual words or meaningful shorter sub-phrases). The decomposition can be at the token level or at the phrase-internal-boundary level.
- Mine Sub-Component Synonyms — For each sub-component, look up known synonyms from the existing synonym graph. The sub-component synonyms are the comparison set for the overlap check.
- Check For Overlap — Compare the phrase-level candidate against the set of sub-component synonyms. If the candidate appears as a synonym of any sub-component, the overlap is detected.
- Veto If Overlap Found — If the phrase-level candidate appears as a synonym of any sub-component, reject it. The candidate is not a real phrase-level synonym; it just renames part of the phrase.
- Promote Only True Phrase Synonyms — Candidates that do not overlap with any sub-component synonym are promoted to phrase-level synonyms. These are real phrase-to-phrase substitutions that preserve all information.
Information Preservation As The Phrase-Level Rule
The patent encodes a simple but powerful constraint: phrase synonymy must preserve the information content of the phrase. A substitution that collapses a qualified phrase to its head term is information-destructive and is rejected outright.
Whole-Phrase Synonymy Required
A phrase synonym must cover the whole phrase, not just one of its parts. If the candidate only matches a sub-component, the candidate is rejected at the phrase level.
- Sub-Component Synonym Lookup — The veto requires knowing the synonyms of each sub-component. This is data the synonym graph already has from upstream mining.
- Overlap Veto — If the phrase-level candidate is also in any sub-component's synonym set, the candidate is rejected. The check is symmetric across all sub-components.
Phrase synonymy is conjunctive over sub-components, not disjunctive.
<\/section>Technical Foundation
The Information-Preservation Test
The check enforces that a phrase-level synonym must add coverage of the whole phrase, not just rename one of its parts.
- Phrase-Level Candidate — A proposed synonym for the entire multi-word query phrase. Arrives from upstream mining with its own evidence.
- Sub-Component Synonym Sets — Known synonyms for each individual sub-component of the phrase. Pulled from the existing synonym graph.
- Overlap Veto — If the phrase-level candidate is also in any sub-component's synonym set, the candidate is rejected on the grounds that it only covers part of the phrase.
- Decomposition Granularity — The level at which the phrase is split into sub-components. Token-level splits are the most aggressive; phrase-internal boundaries are more conservative.
Key Insight: Phrase synonymy is inherently conjunctive. A phrase-level synonym must validate against the whole phrase, not just match one of its constituents. The overlap veto enforces this property cheaply by reusing the sub-component synonym data the system already has.
<\/section>The Process
Where The Check Runs
The veto runs as a quality gate between phrase-pair mining and the runtime synonym table. The gate is cheap because it depends only on existing synonym graph data.
- Phrase Pair Generation — Upstream mining produces a candidate (phrase, synonym) pair from session-based or document-based evidence.
- Sub-Component Decomposition — The phrase is broken into sub-components for the overlap check.
- Synonym Lookup Per Sub-Component — For each sub-component, look up its known synonyms in the existing synonym graph.
- Overlap Detection — Check whether the candidate phrase-level synonym appears in any sub-component's synonym set.
- Final Disposition — Vetoed pairs are dropped. Surviving pairs continue to any remaining quality gates and ultimately to the runtime synonym table.
What This Means for SEO
What This Means for SEO
The information-preservation rule is one reason why long-tail, qualified queries are protected from being collapsed into their head terms by the synonym graph. This has direct implications for keyword targeting and content scoping.
- Qualifiers Are Preserved — Adjectives, colors, sizes, brands, and other qualifiers in a query phrase are not silently dropped. Your content can rely on the qualifier remaining part of the intent, which means qualified long-tail queries are real targeting opportunities.
- Long-Tail Pages Should Earn Their Qualifiers — If you write a page targeting a qualified long-tail query ("vegan running shoes for flat feet"), each qualifier must be reflected in the content. The system protects the qualifier semantics; your content should too.
- Phrase-Level Synonyms Require Phrase-Level Evidence — Real phrase-level synonyms exist ("motor vehicle" and "automobile"), but they need evidence at the phrase level, not just at the component level. This is why head terms like "running shoes" rarely have synonym alternatives in practice.
- Head-Term Targeting Does Not Steal Long-Tail — A page that ranks well for "shoes" will not automatically capture all "red running shoes" traffic because the synonym graph protects the qualified phrasing. The two queries route to potentially different documents.
- Don't Strip Qualifiers In Titles — If your page is the canonical answer for a qualified query, the title should reflect the qualifier. Stripping it to make the title cleaner sacrifices the very signal the information-preservation rule rewards.
- Brand-Plus-Product Stays Together — Compound product names (brand plus product line) are not split by the synonym graph. "Nike Air Max" does not get treated as a synonym of "Nike" alone. Target the full compound when you want the compound traffic.