Generates expansion queries by comparing structural similarity between documents. Bridges template detection and query rewriting — pages that share structural patterns reveal candidate query expansions.
Patent Overview
- Inventor
- Paul Haahr, others
- Assignee
- Google LLC
- Filed
- 2011
- Granted
- 2016-09-06
The Challenge
The Challenge
Query expansion needs candidate terms. Term co-occurrence in documents is one source; structural similarity between documents is another, often richer source — documents sharing a template often share semantically related but lexically diverse content.
- Co-Occurrence Misses Structural Signal — Term co-occurrence captures lexical relationships. Structurally similar documents share semantic ones that lexical analysis misses.
- Templates Carry Meaning — Documents sharing structural templates (recipe pages, product pages, biographies) share semantic patterns that drive valid expansions.
- Structural Similarity Is Measurable — DOM structure, heading patterns, link patterns, table patterns — all quantifiable. Pairwise similarity is computable.
- Expansion Must Generalize — Per query, expansion candidates must generalize across documents. Single-document expansions are too narrow.
- Scale Demands Approximation — Pairwise structural similarity across billions of pages is infeasible. Cluster-based approximation required.
Innovation
How The System Works
The system measures structural similarity between documents, clusters documents by structural pattern, identifies terms that recur across structurally similar documents, and produces these recurring terms as expansion candidates.
- Extract Structural Features — Per document, extract structural features: DOM patterns, heading distribution, link patterns, table structure, content-section signatures.
- Compute Pairwise Similarity — Per pair of documents in candidate sets, compute structural similarity score.
- Cluster By Structure — Group documents into structural clusters. Each cluster shares a structural template.
- Identify Cluster-Recurring Terms — Per cluster, identify terms recurring across cluster members. These are semantically related expansion candidates.
- Score Expansion Candidates — Per candidate, score by cluster size, term coherence, and topical alignment.
- Apply In Query Expansion — Per query, retrieve cluster-derived expansion candidates and use them in retrieval or refinement.
- Continuous Cluster Refresh — Per crawl, clusters refresh as documents evolve. Expansion candidates stay current.
Structure Reveals Semantic Neighbors
The patent's load-bearing idea is that structural similarity between documents captures semantic relationships that lexical co-occurrence cannot. Documents sharing a template share semantic patterns that drive valid query expansions.
Templates Are Semantic Categories
Recipe pages share both structure and semantic patterns. Product pages share both. Biography pages share both. Structure becomes a proxy for semantic category.
- Structural Feature Extraction — DOM, headings, links, tables, content-section signatures. Multi-feature structural fingerprint.
- Structural Clustering — Documents cluster by structural similarity. Cluster membership signals semantic category.
- Cluster-Recurring Terms — Terms recurring across cluster members are semantically related expansion candidates.
Technical Foundation
Technical Foundation
The patent specifies the structural feature extractor, similarity calculator, cluster builder, term recurrence analyzer, candidate scorer, and expansion integrator.
- Structural Feature Extractor — Per document, extracts DOM patterns, heading distribution, link patterns, table structure, content-section signatures.
- Similarity Calculator — Pairwise structural similarity score between candidate documents.
- Cluster Builder — Groups documents into structural clusters. Each cluster shares a template.
- Term Recurrence Analyzer — Per cluster, identifies terms recurring across cluster members.
- Candidate Scorer — Per candidate, scores by cluster size, term coherence, topical alignment.
- Expansion Integrator — Per query, retrieves expansion candidates and integrates with retrieval/refinement.
The Process
The Process
Structural analysis and clustering run offline. Expansion candidate retrieval runs at query time.
- Extract Features Offline — Per document at indexing, extract structural features.
- Compute Similarity — Pairwise similarity computed within candidate sets.
- Cluster Documents — Clusters built by structural similarity.
- Analyze Term Recurrence — Per cluster, recurring terms identified as candidates.
- Score Candidates — Per candidate, scoring runs.
- Cache Per Query — Per common query, expansion candidates cached.
- Apply At Query Time — Per query, cached candidates retrieved and used in retrieval/refinement.
Quality Control
Quality Control
Structural clustering quality determines expansion quality. The patent specifies safeguards.
- Cluster-Coherence Validation — Cluster coherence validated. Clusters with low coherence filtered to reduce noisy expansions.
- Topical-Alignment Check — Expansion candidates checked for topical alignment with query. Off-topic expansions filtered.
- Cluster-Size Bounds — Cluster sizes bounded. Too-small clusters lack signal; too-large clusters lose specificity.
- Spam-Template Filter — Clusters dominated by spam templates filtered. Prevents spam-derived expansions.
- Continuous Refresh — Per crawl, clusters refresh as content evolves.
Real-World Application
Structural-similarity expansion is a foundational query-understanding signal. The pattern of template-derived semantic neighbors informs query refinement, entity recognition, and content-type classification.
- Multi-feature Structural Fingerprint — DOM, headings, links, tables, content-section signatures combine into structural fingerprint.
- Cluster-based Analysis Granularity — Documents clustered by structural similarity. Per-cluster term recurrence yields candidates.
- Template-aware Semantic Insight — Templates carry semantic category meaning. Structurally similar documents share semantic patterns.
Why Template Consistency Helps Discovery
Well-templated content clusters with structurally similar high-quality pages, sharing in their semantic-neighbor pool. Consistent template adoption (recipe schema, product schema, FAQ schema) signals semantic category cleanly.
Why Structured Data Drives Expansion Inclusion
Schema.org markup and consistent DOM patterns are part of what structural-similarity analyzers read. Well-marked-up pages cluster reliably with their semantic neighbors.
<\/section>What This Means for SEO
What This Means for SEO
This patent generates query expansions by clustering documents that share structural templates and mining terms that recur across the cluster. SEO implication: consistent templates and structured data help your pages cluster with high-quality semantic neighbors, sharing in their expansion pool.
- Template Consistency Aids Discovery — Documents sharing a structural template cluster together and share a semantic-neighbor pool. Adopting consistent templates (recipe, product, FAQ structures) signals your semantic category cleanly and joins you to relevant clusters.
- Structured Data Drives Expansion Inclusion — Schema markup and consistent DOM patterns are part of what structural analyzers read. Well-marked-up pages cluster reliably with their semantic neighbors, sharing in the terms that drive query expansions toward them.
- Structure Is A Proxy For Semantic Category — Recipe pages, product pages, and biographies share both structure and meaning, so structure stands in for category. Matching the conventional structure of your content type helps the system place you in the right semantic cluster.
- Recurring Cluster Terms Become Your Expansions — Terms that recur across structurally similar pages become expansion candidates for queries. Naturally covering the vocabulary common to your content category aligns you with those generated expansions.
- Avoid Spam-Template Patterns — Clusters dominated by spam templates are filtered out. Using a structure associated with low-quality mass-produced pages risks being grouped with them rather than with quality neighbors.
- Coherent Clusters Produce Better Signal — Low-coherence clusters are filtered and overly large or tiny clusters are bounded. A clear, conventional, coherent structure helps you land in a high-signal cluster rather than a noisy one.
- Consistency Across A Template Compounds — Refreshed per crawl, clustering rewards sites that apply a clean template consistently. Uniform structure across a content type strengthens your membership in its semantic neighborhood.