Applies form-level synonym substitutions (singular-plural, conjugation, common morphological variants) as part of query processing, handling form variance transparently at retrieval time.
Patent Overview
- Inventor
- Steven D. Baker
- Assignee
- Google LLC
- Filed
- 2013-06-25
- Granted
- 2015-11-10
- Application Number
- US 13/926,432
The Challenge
Form Variants Should Not Force Different Pages
A query for "running shoes" and a query for "run shoes" express the same intent. The system should treat form-level variation (singular-plural, verb tense, common morphological derivations) as transparent during retrieval rather than forcing the engine to find separate matches for each form. Without this transparency, the index would have to carry separate entries for every form and content creators would have to author duplicate variants for each.
- Form Variants Are Constant Noise — Real users mix forms freely. Treating each form as a separate query forces the engine to do duplicate retrieval work and fragments the result population.
- Stemming Is Too Aggressive — Aggressive stemming collapses too many things ("university" to "univers") and can match unrelated documents. A more selective lexical synonym pass is needed.
- Context Sometimes Matters For Form — Singular and plural can carry different intent in some queries. The system should be able to consider form variants without blindly applying them.
- Morphology Is Language-Specific — English has limited morphology; German, Finnish, and Turkish have much richer morphological systems. The synonym table must be language-aware to handle each correctly.
- Lexical Synonyms Are Distinct From Semantic Synonyms — Form variants like singular-plural are different from concept synonyms like car-automobile. Mixing them in one table loses the ability to apply different confidence thresholds.
Innovation
Lexical Synonym Substitution At Query Time
When a query arrives, the system identifies lexical synonyms for each term (form-level variants tracked in a curated table), generates an altered query that includes the lexical synonyms, and processes the altered query to produce search results. The substitution is per-term, not per-phrase, which keeps it cheap and prevents phrase-level intent from being disrupted.
- Receive Query — A query containing one or more terms arrives. The query may already have been through earlier preprocessing (tokenization, normalization).
- Identify Lexical Synonyms Per Term — For each term in the query, look up lexical (form-level) synonyms from the synonym table. The lookup is constant-time per term.
- Generate Altered Query — Construct an altered query that incorporates the lexical synonyms, allowing the engine to retrieve documents containing any of the variant forms. The altered query expresses the union of forms.
- Process The Altered Query — Run the altered query against the index. Retrieval can now match documents that used variant forms of the original terms without requiring the document to contain the literal query form.
- Preserve Original-Form Preference — When ranking results, documents containing the original-form term may be preferred over those containing only variant forms. This keeps form-faithful matches at the top while still allowing variant-form matches to participate.
- Return Results — Return the search results from the altered query to the user. The substitution is transparent to the user and to most downstream pipeline stages.
Form Variance Resolved At Query Time
Rather than indexing every form variant separately (which would bloat the index) or stemming aggressively (which would conflate too much), the system handles form variance at query time using a curated lexical synonym table.
Per-Term, Not Per-Phrase
The substitution applies term by term. Phrase-level intent is preserved because no phrase boundary is crossed by the substitution.
- Curated Lexical Table — The lexical synonym table is curated rather than mined. Each entry is a known form-variant pair (singular/plural, conjugation, common derivation).
- Altered Query Generation — The altered query expresses the union of original and variant forms. Retrieval finds documents matching any form.
Form variants are handled invisibly so authors and searchers do not have to coordinate.
<\/section>Technical Foundation
Lexical Vs Semantic Synonyms
This patent is explicitly about lexical (form-level) synonyms, distinct from the semantic synonyms covered by document and session mining. The two synonym types have different confidence thresholds and different application logic.
- Lexical Synonym — A form variant of a term: singular/plural, verb conjugations, common derivational suffixes. Curated rather than mined.
- Altered Query — The original query with lexical synonym substitution applied. Used in place of the literal query for retrieval.
- Per-Term Lookup — The substitution is applied per term, not per phrase. Phrase-level synonymy is handled by other components.
- Language-Specific Table — The lexical synonym table is language-specific. English, German, and Japanese each have their own table with morphology-appropriate entries.
Key Insight: Splitting lexical and semantic synonym handling into different tables (and pipelines) lets each have appropriate confidence thresholds. Lexical synonyms are nearly always safe to apply; semantic synonyms require contextual gating. Mixing them would force the system to use one threshold for both, sacrificing precision somewhere.
<\/section>The Process
Query Processing With Lexical Substitution
The lexical substitution stage is part of standard query processing. It runs after tokenization and before retrieval, applying its substitutions transparently.
- Tokenize Query — Split the query into terms (and possibly phrases). Each term becomes a candidate for lexical substitution.
- Lexical Lookup Per Term — For each term, look up its known form variants in the lexical synonym table.
- Build Altered Query — Compose the altered query by including the original term and its variants in disjunction. The altered query has more breadth than the original.
- Retrieve Against Index — Run the altered query against the inverted index. Documents matching any form variant become candidates.
- Rank With Original-Form Preference — Apply ranking signals, possibly preferring documents that match the original form over variant-only matches when other signals are equal.
What This Means for SEO
What This Means for SEO
Lexical synonyms are why most form-level variation does not require separate pages. Internalizing this rule simplifies a lot of content strategy decisions and prevents accidental over-fragmentation of your content.
- Don't Build Pages For Form Variants — Singular vs plural, verb tense variants, and common morphological forms are handled by lexical synonyms. One canonical page is enough. Splitting them creates near-duplicate content for no retrieval benefit.
- Pick The Most Natural Form For Your H1 — Choose the surface form your audience actually uses for the H1 and primary copy. The system maps variants to your page either way; pick the one with the cleaner read for human visitors.
- URL Slugs Should Use Consistent Form — Use one canonical form (typically the form users search for most) in your slug. Avoid creating URL variants for plural vs singular; redirect non-canonical forms to keep the canonical authority concentrated.
- Long-Tail Variants Should Still Use Lexical Cousins — When you cover long-tail variants in body copy, you do not need to repeat every form. One mention plus natural language usage of cousin forms is sufficient because the lexical substitution handles the rest.
- Verb-Form Pages Are Almost Always Wrong — Pages targeting different conjugations of the same verb ("run", "running", "ran") are nearly always wasted effort. The lexical pipeline collapses them at query time. Build one canonical page per intent.
- Original-Form Preference Still Rewards Exact Match — While variants are matched, the ranking still tends to prefer documents containing the original query form. This is the case for using the searcher's most common form rather than your own internal terminology.
- Language-Specific Pages Need Their Own Lexical Coverage — When you localize content, the target language has its own lexical synonym table. Pick the most-searched form in the target language; do not assume the same canonical form your English version used will be optimal everywhere.