Generates query candidates by analyzing structural similarity between documents (template patterns, section headings, layout), enabling retrieval of structurally-equivalent content the user did not literally search for.
Patent Overview
- Inventor
- Srinivasan Venkatachary
- Assignee
- Google LLC
- Filed
- 2009-11-09
- Granted
- 2013-01-01
- Application Number
- US 12/615,028
The Challenge
The Challenge
Documents that serve the same purpose often share structure (product pages have specs, recipe pages have ingredients-and-steps, news articles have lead-and-body). Users searching for one such document would benefit from finding structurally-equivalent alternatives, but text-only retrieval misses the structural signal.
- Text Similarity Misses Structural Equivalence — Two product pages with the same shape but different products have low text overlap. Structural similarity catches the relationship pure text retrieval cannot.
- Document Templates Encode Purpose — When documents share template patterns, they typically serve the same purpose. Identifying the template is a way to identify functional equivalents.
- Layout And Heading Patterns Are Signal — Section heading sequences, layout grids, table patterns all carry information about document type. The system can read these structural signals.
- Generated Queries Expand Recall — Structural-similarity-derived queries find documents the original literal query missed. The system can return structurally-equivalent alternatives the user might prefer.
- Generation Must Preserve Intent — Generated queries cannot drift from the user's intent. Structural similarity must augment, not replace, the literal query.
Innovation
How The System Works
The system analyzes the structural pattern of documents (template, headings, layout), identifies structurally-similar documents, derives query candidates that would retrieve them, scores candidates against the user's likely intent, and uses high-scoring candidates to expand retrieval beyond literal-query results.
- Extract Document Structure — Per document, extract template signals: heading sequence, section pattern, layout grid, table structure. Output is a structural fingerprint.
- Cluster By Structural Similarity — Documents with similar fingerprints cluster together. Each cluster represents a structural template type (product page, recipe, news article).
- Map User Query To Structural Class — Given the user's query and retrieved seed documents, identify the structural class the user is searching within.
- Generate Query Candidates — From within the structural cluster, generate candidate queries that would retrieve structurally-similar but not text-similar documents. Candidates explore the structural class.
- Score Candidates For Intent Preservation — Each candidate scores on intent preservation. Candidates that drift from the original intent are filtered out.
- Expand Retrieval — Top candidates retrieve additional structurally-similar documents. Combined with the original retrieval, the result set covers structural equivalents.
- Rank Combined Results — Standard ranking applies to the combined set. Users see both literal-match and structurally-similar documents in the SERP.
Structure As A Retrieval Dimension
The patent's load-bearing idea is to use document structure as a retrieval signal complementing text. Structurally-equivalent documents serve the same purpose; finding them expands the useful result set.
Templates Encode Purpose
When documents share structural patterns, they typically serve the same functional purpose. The structural pattern is a more stable signal of purpose than the specific text in any one instance.
- Structural Fingerprints — Heading sequences, layout patterns, section structures form fingerprints that identify document templates. Fingerprints are the substrate for similarity.
- Structural Clustering — Documents cluster by structural fingerprint. Clusters represent template types: product pages, recipes, news articles.
- Query Generation From Cluster — Within a cluster, query candidates expand retrieval to structurally-equivalent documents. Candidates explore the template's coverage.
Technical Foundation
Technical Foundation
The patent specifies the structural-fingerprint extractor, the clustering algorithm, the query-candidate generator, the intent-preservation scorer, and the retrieval-expansion logic.
- Structural Fingerprint Extractor — Per document, extracts heading sequence, layout grid, section pattern. The output is a vector encoding structural signal.
- Structural Clustering — Documents cluster by fingerprint similarity. Hierarchical or graph-based clustering produces template-type clusters.
- Query Class Mapper — Given the user's query, retrieved documents, and their structural cluster, identifies the active structural class for query generation.
- Candidate Query Generator — Within the structural cluster, derives query candidates that would retrieve structurally-similar documents. Uses template variation patterns.
- Intent Preservation Scorer — Per candidate, scores on intent preservation. Candidates that drift from the original intent are filtered.
- Retrieval Expansion — Top candidates retrieve in parallel with original query. Combined set goes to ranking.
The Process
The Process
The pipeline runs in the query path. Structural fingerprinting is precomputed offline; query generation and expansion happen at query time within the latency budget.
- Receive Query And Initial Retrieval — Standard retrieval produces seed documents for the literal query. These feed structural analysis.
- Identify Structural Class — From the seed documents, identify the structural class (template type) the user is searching within.
- Generate Candidate Queries — Within the structural cluster, derive candidate queries that retrieve structurally-similar documents.
- Score Intent Preservation — Per candidate, score intent preservation. Filter candidates that drift.
- Run Parallel Retrieval — Top candidates retrieve in parallel with the original query. Each candidate produces its own result set.
- Merge And Rank — Combined result set goes to standard ranking. Deduplication handles overlap.
- Render SERP — Users see literal-match and structurally-similar documents in the ranked SERP.
Quality Control
Quality Control
Wrong structural matching expands retrieval irrelevantly. The patent specifies safeguards.
- Fingerprint Stability — Structural fingerprints must be stable across minor layout variations. Fingerprint extraction is calibrated to ignore cosmetic differences.
- Cluster Coherence — Clusters must be coherent (members really share template). Coherence is monitored; bad clusters split or refined.
- Intent Preservation Strictness — Candidates must preserve intent strictly. Drift-prone candidates are filtered before retrieval.
- Bounded Expansion — Number of expansion candidates per query is bounded. Too many expansion candidates dilute the result set.
- Outcome Monitoring — Engagement on expansion-derived results vs original-query results is monitored. Persistent poor performance triggers parameter adjustment.
Real-World Application
Structural-similarity query expansion underpins how Google retrieves template-equivalent content: product alternatives, recipe variants, news-style coverage of similar events. The primitives inform e-commerce and content recommendation as well.
- Template-based Similarity Dimension — Templates encode purpose. Structural similarity finds documents serving the same purpose with different content.
- Cluster-driven Generation Source — Query candidates derive from within the structural cluster. The cluster bounds what equivalents the system retrieves.
- Parallel Retrieval Pattern — Original and expanded queries retrieve in parallel. Combined ranking selects the best across both.
Why Consistent Templates Help Discoverability
Pages following recognized template patterns (well-structured product pages, well-structured recipe pages) cluster cleanly into template types and surface as structurally-equivalent alternatives in expanded retrieval.
Why Schema Markup Reinforces Template Signal
Structured data (Schema.org Product, Recipe, Article) gives the template detector clean signal. Pages with strong schema coverage cluster more reliably and surface in template-expansion retrievals more often.
<\/section>What This Means for SEO
What This Means for SEO
The patent uses document structure (templates, headings, layout) as a retrieval signal, finding structurally-equivalent documents and generating queries that retrieve them. SEO implication: consistent, recognizable template patterns and supporting schema help your pages cluster by purpose and surface as alternatives in expanded retrieval.
- Consistent Templates Aid Discoverability — Pages following recognized template patterns (well-structured product or recipe pages) cluster cleanly into template types and surface as structurally-equivalent alternatives in expanded retrieval. Use a consistent, purpose-matched template for each content type.
- Schema Reinforces Template Signal — Structured data (Schema.org Product, Recipe, Article) gives the template detector clean signal. Pages with strong schema coverage cluster more reliably and surface in template-expansion retrievals more often than markup-poor pages.
- Templates Encode Purpose — Shared structure signals shared functional purpose more stably than the specific text in any one page. Structuring a page to clearly express its purpose (specs for products, ingredients-and-steps for recipes) makes its purpose legible to the structural detector.
- Structurally-Equivalent Pages Compete Together — The system expands retrieval to structurally-similar documents serving the same purpose. Your page can surface for queries it did not literally match if it is the structural equivalent of a strong result. Matching the purpose-template widens reach.
- Headings And Layout Are Signals — The detector reads headings and layout, not just body text. Clear, conventional section headings that match the content type strengthen the structural fingerprint, helping the system place you in the right template cluster.
- Follow Established Patterns For Your Type — Recognized patterns cluster reliably; idiosyncratic layouts cluster poorly. Adopting the conventional structure for your content type (rather than a unique design) improves clustering and expansion-retrieval eligibility.
- Structure Complements Text Relevance — Structural similarity is a complement to text retrieval, not a replacement. Strong on-topic text plus a clean purpose-matched template together maximize both literal and structural-expansion visibility.