Selects passages for generative search responses based on diversity and completeness criteria, ensuring the AI Overview or SGE answer is comprehensive and non-redundant rather than echoing the same point from multiple sources.
Patent Overview
- Inventor
- Nitin Gupta
- Assignee
- Google LLC
- Filed
- 2024-08-23
- Granted
- 2026-05-08 (published application)
- Application Number
- US 18/812,540
The Challenge
Naive Retrieval For Generative Search Produces Redundant Answers
When a generative model writes a long-form answer (AI Overviews, Search Generative Experience), the input passages it conditions on shape the answer's quality. If the retrieved passages all repeat the same fact from different sources, the synthesized answer becomes repetitive and incomplete. If they cover the topic from multiple angles, the answer is comprehensive. The retrieval stage must control passage diversity explicitly, not just relevance.
- Relevance Alone Selects Similar Passages — Top-k retrieval by relevance score gravitates to passages that all express the same dominant point. The generative model then has nothing new to add and the answer is redundant.
- Diversity Without Relevance Is Off-Topic — Selecting passages purely for diversity without a relevance floor pulls in tangential content. The answer wanders away from the user's actual question.
- Completeness Needs Cross-Source Coverage — A comprehensive answer covers multiple aspects of the question. The retrieval has to actively ensure coverage across the aspects, not assume top relevance scores will produce it.
- Inclusion Criteria Must Be Tunable — Different query types need different diversity/completeness balances. Factual queries need tight relevance; exploratory queries need broader coverage. The criteria must be query-aware.
- Token Budget Caps Passage Count — The generative model has a finite context window. Only so many passages can be included. The selection has to pick the highest-value subset, not just top-k by relevance.
Innovation
Select For Relevance Plus Diversity Plus Completeness
The system identifies the most relevant passage in each top-ranking document for the query, then selects from among the most-relevant passages those that meet inclusion criteria. The criteria combine a minimum relevance threshold with maximization of diversity against passages already selected. The selected passage set is then sent to the generative model as conditioning input.
- Run Standard Retrieval — Use existing search ranking to identify the top-ranking documents for the query. This is the candidate document pool.
- Identify Most-Relevant Passage Per Document — Within each top-ranking document, extract the single most relevant passage. The passage is the document's best contribution to answering the query.
- Apply Relevance Threshold — Filter the most-relevant passages by a minimum relevance score. Passages below threshold are rejected outright; they would dilute the answer.
- Compute Diversity Against Selected Set — For each candidate passage, compute its diversity against passages already in the selected set. Diversity can be measured via embedding distance, lexical overlap, or claim-level comparison.
- Maximize Diversity At Inclusion — Greedily add passages that maximize diversity gain while still meeting the relevance threshold. Skip passages that are near-duplicates of already-selected content.
- Cap By Token Budget — Stop adding passages when the token budget for the generative model's context window is reached. Prioritize the highest-diversity-gain additions until the budget runs out.
- Feed Selected Passages To Generator — Send the selected passage set as conditioning input to the generative model. The model synthesizes the long-form answer with the curated diverse evidence.
Diversity And Completeness As First-Class Retrieval Goals
The patent reframes retrieval for generative search. Where classical retrieval optimizes only for relevance, generative-search retrieval optimizes for relevance plus diversity plus completeness simultaneously. The reframing is what produces well-rounded AI Overviews instead of repetitive ones.
Three Criteria Together
Relevance keeps the answer on-topic. Diversity prevents redundancy. Completeness ensures coverage across the question's aspects. All three are enforced at the retrieval stage.
- Relevance Floor — Every selected passage must clear a minimum relevance score. Diversity does not override relevance; it is layered on top.
- Diversity Maximization — Greedy selection maximizes diversity gain at each step. Near-duplicate passages get skipped even if relevance is high.
- Completeness Coverage — The selection actively covers multiple aspects of the query. Coverage gaps are detected and filled where possible.
Generative search retrieval is curated, not just ranked.
<\/section>Technical Foundation
What The Selector Computes
Per-passage decisions consider relevance to the query and diversity relative to passages already selected.
- Most-Relevant Passage Per Document — From each top-ranking document, the single passage that scores highest against the query. One passage per document keeps the per-source contribution bounded.
- Relevance Score — Standard query-passage relevance, often from a learned ranking model. Must exceed a configured floor for the passage to be considered.
- Diversity Score — Pairwise similarity between candidate passage and already-selected passages. Lower similarity (higher diversity) is preferred.
- Inclusion Criteria — The composite decision rule: relevance threshold AND diversity gain above per-step minimum AND remaining token budget. Tunable per query type.
Quality Metrics
- Diversity Gain — Higher when the candidate is least similar to any already-selected passage. Drives greedy selection at each step.
div_gain(p, S) = 1 - max( sim(p, s) for s in S ) - Composite Inclusion Score — Both conditions must hold. The patent's contribution is enforcing both simultaneously rather than picking top-k by relevance alone.
include(p) = relevance(p) >= floor AND div_gain(p, S) >= min_gain
Key Insight: The diversity-aware selection is what makes AI Overviews readable. Without it, the synthesized answer would echo the same point three times from different sources because top relevance scores cluster on the most common phrasing of an answer. The diversity layer surfaces nuance, edge cases, and complementary aspects that a pure relevance ranker would never include.
<\/section>The Process
End-To-End Generative Search Retrieval
The selection runs after standard retrieval and before generative synthesis.
- Standard Retrieval — Run the query through normal ranking to produce top-k documents.
- Per-Document Passage Extraction — From each top document, identify the single most-relevant passage. Discard the rest of the document for this synthesis.
- Relevance Filtering — Drop passages below the relevance floor.
- Greedy Diverse Selection — Iteratively add the passage that maximizes diversity gain while clearing the relevance floor. Stop when token budget exhausted or no more passages clear the criteria.
- Generative Synthesis — Pass the selected passage set as context to the generative model. Model synthesizes the long-form answer with diverse evidence.
- Cite And Surface — Each selected passage's source document is cited in the rendered answer, preserving attribution to the underlying sources.
What This Means for SEO
What This Means for SEO
This is the most recent Gupta patent and one of the most consequential for the current SEO era. It defines how AI Overviews and Search Generative Experience actually pick which pages get cited in the synthesized answer.
- One Passage Per Page Is The Selection Unit — AI Overviews pull one most-relevant passage per document. Your page's best passage determines whether you're included, not the page as a whole. Every page should have a single clearly-delineated passage that is the canonical answer to its target query.
- Diversity Beats Sameness — Pages that say the same thing as competing pages compete on relevance score alone. Pages that add a distinct angle, perspective, or claim get selected for the diversity dimension even if their relevance is slightly lower.
- Cover Aspects Other Pages Miss — The selection actively wants to cover multiple aspects of a question. Content that addresses an under-served aspect of a topic (edge case, counterargument, specific scenario) is structurally advantaged for AI Overview inclusion.
- Above-The-Fold Answer Format Wins — The most-relevant passage extractor goes to the top of the document first. Burying your best answer mid-page reduces the chance of being picked. Front-load the canonical answer.
- Multiple Top-Ranking Pages Get Cited Together — AI Overviews cite multiple sources, not just rank-1. If you can crack into the top-ranking document set for a query, your passage has a real chance of being selected even from a non-rank-1 position.
- Distinct Phrasing Helps Diversity Score — When your page expresses the answer in distinct phrasing from competitors (without changing the underlying facts), you improve your diversity gain. Echoing competitor wording lowers your odds.
- Comprehensive Coverage Loses To Focused Coverage — Long do-everything pages have weaker per-aspect passages because each aspect is diluted across the page. Tighter pages with focused coverage of one aspect produce stronger per-passage relevance and clearer diversity signals.
- Citation Visibility Compounds With AI Overview Inclusion — Being cited in an AI Overview surfaces your brand alongside the answer, even when the user does not click through. Optimizing for AI Overview inclusion is brand surface, not just CTR.