Generates synthetic queries by reading the HTML structure of documents that already match a seed query, then validates them by performance, turning page structure into retrievable intent.
Patent Overview
- Inventor
- Steven D. Baker
- Assignee
- Google LLC
- Filed
- 2010-11-19
- Granted
- 2013-01-01
- Application Number
- US 12/950,910
The Challenge
Producing Synthetic Queries That Are Actually Useful
Many SEO and IR workflows need synthetic queries that retrieve documents like a known target. Hand-writing them does not scale. Generating them from word frequencies alone produces noise. The system needs a generator that reads document structure and turns it into queries that have predictive power against the live retrieval engine, then filters the candidates by how well they actually perform.
- Word-Frequency Generation Is Noisy — Picking high-TF or high-TF-IDF terms from a target document creates queries that may retrieve the target but lack semantic shape. The queries miss the structural cues that make retrieval focused.
- Structure Carries Intent — HTML tags and structured fragments (titles, headings, table headers, schema markup) tell you what the document considers its load-bearing content. Ignoring that structure throws away the strongest signal.
- Need Validation On Performance — Synthetic queries should be tested against the live engine. Queries that fail to retrieve target-like documents should be discarded rather than carried forward with a low confidence score.
- Site-Mate Pages Share Structure — Documents from the same website tend to share structural conventions. A template derived from one page generalizes to its site-mates, which is the leverage that makes the whole pipeline economical.
- Generic Templates Drown The Output — Without performance validation, the generator emits hundreds of plausible-looking queries per document and the downstream consumer has no way to pick the useful ones. The performance gate is what turns the output from candidates into queries.
Innovation
Templates From HTML Fragments, Then Test Them
The system identifies embedded coding fragments (HTML tags surrounding content) in a structured document, pairs each fragment with a seed query, and generates a query template that contains a generative rule. The template is then applied to other documents on the same site, producing candidate synthetic queries. Each candidate is run against the engine and its performance is measured before promotion.
- Identify Coding Fragments — Scan the document for HTML tags around content. Title tags, headings, structured-data attributes, and table cells are typical fragments. Each is a position from which terms can be drawn.
- Pair Fragments With Seed Query — Take a seed query that already retrieves the document. Pair each fragment with the seed to build template candidates. The pairing encodes which structural position the seed terms came from.
- Build Query Templates With Generative Rules — Each template encodes a generative rule: take terms appearing in this tag position, fill them in, and emit a new query. The rule abstracts the position-to-term mapping so it can apply across many documents.
- Apply Templates To Site-Mate Documents — Run each template against other documents on the same website. The terms that match the template structure produce candidate synthetic queries. Site-mates share structure, so the template fits multiple pages.
- Measure Performance — Issue each candidate against the live engine. Measure how well its result set aligns with the target intent. Performance metrics can include retrieval precision, target inclusion, and result-set diversity.
- Filter And Promote — Keep the synthetic queries that perform above a threshold; discard the rest. The promoted queries become inputs to downstream consumers such as synonym mining, query expansion, or recommendation systems.
Structure Plus Performance Gates The Output
Other query-generation approaches either skip structure (and produce noise) or skip performance validation (and produce useless candidates). The patent's contribution is combining both gates so that promoted queries are both structurally grounded and retrievally proven.
Templates Derive From Real Documents
The templates are not hand-authored; they are derived from documents that the system already knows are responsive to seed queries. The site itself authors the templates by being structurally consistent.
- Structural Source — HTML tags, schema attributes, table cells, and headings are the positions that templates target. These positions carry the strongest authorial signal about what the document is about.
- Performance Gate — Every candidate query is tested against the live engine. Queries that fail to retrieve target-like results are discarded, regardless of how plausible they look on paper.
Structure proposes. Performance disposes.
<\/section>Technical Foundation
What The System Reads
The technique works because pages on the same site share structural conventions. A template derived from one document tends to fit its site-mates, which is the leverage that makes the pipeline economical.
- Embedded Coding Fragment — A piece of structured content surrounded by HTML tags (title, h1, td, schema attributes). Identifies the position from which terms should be drawn.
- Query Template — A generative rule with at least one variable slot, where the slot is filled from the matching fragment in a target document. Templates are reusable across documents.
- Site-Mate Generalization — The assumption that pages on the same website share enough structural convention that a template derived from one fits the others. This is what makes the pipeline produce many queries per seed.
- Performance Measure — An evaluation of how well the synthetic query retrieves documents similar to the seed document, used as the survival filter before promotion.
Key Insight: The performance gate is what separates this approach from naive structural generation. By requiring candidates to actually retrieve target-like documents, the pipeline filters out the templates that look structurally plausible but fail to capture the underlying intent. Structure is necessary; performance is what makes structure sufficient.
<\/section>The Process
End-To-End Pipeline
The full pipeline runs from a seed query and a responsive document to a set of validated synthetic queries that can drive downstream applications.
- Seed Query And Document — Begin with a seed query Q and a document D that Q retrieves. D is the structural source for templates.
- Fragment Extraction — Identify the embedded coding fragments in D: title, h1, h2, schema, table headers. Each fragment is a position that may contain seed terms.
- Template Construction — For each fragment containing seed terms, build a template that captures the position-to-term mapping. The template is parameterized so it can be applied to other documents.
- Site-Mate Application — Apply each template to other documents on the same website. The matching fragments produce candidate synthetic queries.
- Performance Test — Run each candidate against the live engine. Measure retrieval quality against the target intent.
- Promote Survivors — Candidates that pass the performance threshold become validated synthetic queries available to downstream systems.
What This Means for SEO
What This Means for SEO
The query-generation pipeline is one of the underappreciated reasons why HTML structure matters for retrieval at scale. The engine is reading your tags as much as it is reading your text, and that read shapes which queries the system considers your pages relevant for.
- Title And Heading Tags Drive Query Templates — Terms inside title, h1, and h2 tags are first-class candidates for filling query template slots. Generic, non-descriptive headings throw away this signal entirely and prevent your page from participating in the template generation pipeline.
- Structural Consistency Across A Site Helps — When pages of the same type on your site share the same structural pattern (e.g., product pages all use the same title and heading conventions), templates derived from one page generalize to the others. Inconsistency forces the engine to rebuild templates per page and limits site-mate generalization.
- Schema And Microdata Are Template Fuel — Structured data attributes (itemprop, JSON-LD types) are explicit fragment annotations that templates can target. A complete schema implementation is more readable to this pipeline than implicit text.
- Performance Is The Final Gate — Even with strong structure, the system discards templates that fail to retrieve performant results. Structural optimization without content quality is filtered out, so investing in both is necessary.
- Table Headers Are Query Slots — Comparison tables and product listings often use th tags for column headers. These act as labeled positions that templates can target. Properly marked-up tables feed query templates the engine can apply across your site.
- Boilerplate Headers Hurt You — If every page on your site has the same h1 ("Welcome", "Home", site name), the templates derived from one page do not specialize to the others. The site-mate generalization works against you when content is structurally identical.
- Anchor Text Inherits Some Of This — Internal anchor text that points to your pages with descriptive terms acts as another labeled position. Anchor-based template generation can produce queries that consume your internal linking pattern.