Generates a topical taxonomy on the fly from the phrases appearing in result documents, so the system can offer refinement suggestions and group results by sub-topic without a manually-curated category structure.
Patent Overview
- Filed
- 2004-07-26
- Granted
- 2008-09-16
- Application Number
- US 10/900,062
The Challenge
The Challenge
Search results often span multiple sub-topics under one parent topic. Users searching a broad term want to drill down, but the system has no built-in taxonomy that anticipates which sub-topics to offer. Manual taxonomies cannot keep up with the live state of the web.
- Manual Taxonomies Cannot Scale — Hand-curated subject categories cannot cover every topic the web touches, and they cannot stay current as topics emerge and evolve. The system needs a taxonomy that builds itself from the live result set.
- Broad Queries Need Sub-Topic Refinement — A query for 'machine learning' could mean theory, libraries, careers, courses, or news. Users want to refine. Without a taxonomy of refinements, the system can only return a single ranked list and force the user to scan it.
- Result Phrases Encode Latent Structure — The phrases that appear in top results reveal the sub-topics that matter. Clustering those phrases produces a data-driven taxonomy that reflects how content is actually organized.
- Refinement Suggestions Must Be Live — What is trending shifts continuously. A refinement set computed monthly misses emerging sub-topics. The taxonomy must be regenerated on demand against the current result corpus.
- Phrase Clusters Must Be Coherent — Raw phrase frequencies are too noisy. The system must cluster related phrases so each refinement label represents a coherent sub-topic, not an arbitrary collection of n-grams.
Innovation
How The System Works
The patent harvests phrases from the top-ranked result documents for a query, clusters semantically related phrases together, picks a representative label per cluster, and presents the cluster labels as refinement suggestions that drill down into the taxonomy.
- Retrieve Initial Result Set — Run the user's query through the standard ranking pipeline and retrieve the top-K results. These results are the corpus from which the taxonomy will be derived.
- Extract Candidate Phrases — From each result, extract phrases that appear with high relative frequency (terms that are more common in the result set than in the general web corpus). These are the phrases that distinguish this topic from others.
- Cluster Phrases By Co-Occurrence — Phrases that frequently appear together in the same documents are clustered. Each cluster represents a coherent sub-topic. Clustering uses co-occurrence statistics across the result set.
- Pick A Representative Label Per Cluster — For each cluster, the highest-frequency or most-canonical phrase is chosen as the label. The label is what the user sees as a refinement suggestion.
- Score Clusters For Display Priority — Clusters are ranked by relevance, coverage, and diversity. The most-useful refinements are surfaced first; weak or redundant clusters are hidden.
- Surface Refinements In The SERP — Selected cluster labels appear as refinement suggestions alongside the result list. The user can click a refinement to drill into that sub-topic.
- Re-Run On Refinement — When the user clicks a refinement, the system re-runs the query with the refinement appended, harvests new phrases, and generates a new taxonomy. The drill-down repeats at each level.
Taxonomy Emerges From Results
The patent's load-bearing idea is to derive the taxonomy from the result set itself rather than imposing one from above. The taxonomy follows the live data wherever it goes, capturing emerging sub-topics that no manual classification would have anticipated.
Data-Driven Categories
Manual taxonomies impose a fixed structure on a fluid web. Phrase-derived taxonomies discover the structure latent in the current results. The shift from imposed to discovered structure is the patent's key move.
- Phrases As Structure — The phrases that distinguish a topic also reveal its internal structure. By harvesting them and clustering by co-occurrence, the system reads the topic's natural shape.
- Cluster Labels As Refinements — Each cluster's representative phrase becomes a refinement label. Users see refinements that reflect the actual sub-topics in the current result set, not a pre-defined menu.
- Recursive Drill-Down — When the user refines, the system regenerates the taxonomy for the narrower result set. The taxonomy stays accurate at every level of drill-down because it is computed anew each time.
Technical Foundation
Technical Foundation
The patent specifies the phrase-extraction infrastructure, the clustering algorithms, and the refinement-ranking model.
- Phrase Extraction Pipeline — Phrases are extracted from result documents using statistical tests that compare in-result frequency to general-corpus frequency. Phrases with strong skew are kept as candidates.
- Co-Occurrence Matrix — A phrase-by-document co-occurrence matrix is built for the result set. The matrix is the input to the clustering step and captures the latent sub-topic structure.
- Clustering Algorithm — Hierarchical or graph-based clustering groups phrases by co-occurrence similarity. The algorithm produces a tree of clusters that can be cut at any level for taxonomy generation.
- Cluster Labeling — For each cluster, the highest-frequency or most-discriminating phrase becomes the label. Tie-breaking uses canonicalization rules to pick consistent labels across queries.
- Cluster Quality Scoring — Clusters are scored on cohesion (internal phrase relatedness), distinctness (separation from other clusters), and coverage (fraction of result documents the cluster touches). High-quality clusters are surfaced as refinements.
- Caching For Common Queries — Common queries have cached taxonomies that refresh on a schedule. Long-tail queries compute the taxonomy on demand. Caching keeps the latency budget manageable.
The Process
The Process
The taxonomy pipeline runs in the query path with caching for common queries. The added latency is small because the heavy lifting happens once per cache cycle, and the user-perceived response is enriched with refinements alongside results.
- Receive Query And Retrieve Results — Standard query processing retrieves the top-K results from the ranking pipeline. These results are passed to the taxonomy generator.
- Check Cache — If the taxonomy for this query is cached and still fresh, use it directly. Common queries hit the cache; long-tail queries fall through to compute.
- Extract Phrases — From the result set, extract candidate phrases using the relative-frequency test. The result is a list of distinguishing phrases for this topic.
- Build Co-Occurrence Matrix — Construct the phrase-document co-occurrence matrix. The matrix is the input to clustering.
- Run Clustering — Apply the clustering algorithm to produce a hierarchy of phrase clusters. Cut at the appropriate level for the desired number of refinements.
- Score And Rank Refinements — Compute cluster quality scores. Sort by score. Pick the top-N as the displayed refinements.
- Render In SERP And Cache — The selected refinements are rendered alongside the result list. The computed taxonomy is cached for the next user issuing the same query.
Quality Control
Quality Control
Taxonomy generation produces variable quality across queries. The patent specifies safeguards that keep low-quality taxonomies from being displayed.
- Minimum Cluster Quality Threshold — Clusters below a minimum quality score are not shown. If no clusters pass the threshold, the system displays no refinements rather than surfacing weak ones that would confuse users.
- Diversity Enforcement — Selected clusters must be sufficiently distinct from each other. Redundant refinements offering effectively the same drill-down are filtered so the displayed set covers diverse sub-topics.
- Phrase Filter For Spam — Spammy or boilerplate phrases that appear frequently in low-quality content are filtered out of the candidate phrase set before clustering. This prevents spam patterns from influencing the taxonomy.
- Cache Freshness Monitoring — Cached taxonomies are revalidated periodically. If the underlying result set has shifted substantially, the cache is invalidated and a fresh taxonomy is computed.
- Manual Override For Sensitive Queries — Some queries (medical, legal, financial) require curated refinement sets rather than algorithmic ones. The pipeline can substitute curated taxonomies for these classes.
Real-World Application
Phrase-derived taxonomy generation is visible in Google's 'Related searches' and 'People also ask' modules. Its primitives also inform Bing's and Wikipedia's category-suggestion systems.
- Per-query Taxonomy Granularity — Each query produces its own taxonomy, customized to the result set rather than pulled from a global category tree. Granularity is set by the data itself.
- Recursive Drill-Down Depth — Each refinement triggers a fresh taxonomy generation, so users can drill multiple levels deep with the taxonomy regenerating at each step to reflect the narrower context.
- Live Update Cadence — Cached taxonomies refresh on schedule; long-tail queries generate fresh taxonomies on demand. The taxonomy stays close to the current state of the web.
Why Related Searches Matter For SEO
The 'Related searches' phrases are the refinement labels generated by this patent's primitives. Owning the phrase that becomes a refinement label captures a second wave of traffic from the same root query. SEO practice of mining refinements for content gaps traces back to this mechanism.
Why Phrase Coverage Trumps Keyword Density
The patent makes phrase coverage (covering many distinct sub-topic phrases) the structural lever for visibility, not keyword repetition. Content that hits the canonical phrase set for a topic appears in more taxonomy branches and earns more refinement-driven traffic.
<\/section>What This Means for SEO
What This Means for SEO
Search systems can build taxonomies on the fly from result phrases, so the phrases you use, and the way you cluster them, shape how your content slots into machine-built categories.
- Phrase Coverage Drives Category Inclusion — Pages that consistently use the canonical phrase set for a topic get clustered into the right machine-generated category. Audit the phrases your competitors share, the ones you are missing are the gates into the taxonomy node.
- Sub-Topic Phrases Open New Doorways — When the system auto-generates subcategories, pages with the right sub-topic phrase become the canonical entry for that subcategory. Sub-topic phrase coverage is how you win narrow-but-traffic-rich niches.
- Refined Queries Are Discovery Surfaces — Auto-generated taxonomy labels become refinement suggestions in the SERP. Owning the phrase that becomes a refinement label puts you in front of a second wave of traffic from the same root query.