Ranks documents by the information they add over what the user has already seen, rather than by absolute relevance, so the SERP becomes a sequence of complementary perspectives rather than a list of mostly-redundant top-relevance results.
Patent Overview
- Inventor
- Victor Carbune, Pedro Gonnet Anders
- Filed
- 2018-10-18
- Granted
- 2020-04-23 (published as WO 2020/081082 A1)
- Application Number
- PCT/US2019/056654
The Challenge
The Challenge
Standard ranking returns the top-relevance documents, which often overlap heavily in content. A user who reads the first result gains less from the second because both say similar things. Ranking by information gain produces a more useful SERP.
- Top-Relevance Results Often Duplicate — For many queries, the top ten results say similar things in slightly different words. The user does not gain much from result two if result one already answered the query.
- Real Value Comes From New Information — A page that adds a distinct angle, dataset, or perspective is more valuable to a user who has already seen the top result. Pure relevance ranking misses this.
- Information Gain Is Context-Dependent — Whether a page adds information depends on what the user has already consumed. The system needs to model the user's information state to compute gain.
- Gain Estimation Must Be Cheap — Computing gain for every candidate against every prior result is expensive. The system needs efficient estimators that scale to web traffic.
- Gain And Relevance Must Combine — Pure gain optimization could surface off-topic novelty. The system must combine information gain with relevance so results are both new and relevant.
Innovation
How The System Works
The system estimates each candidate document's information gain relative to a model of what the user has already seen, combines gain with relevance into a composite score, and ranks results by the composite so the top-of-SERP fills with complementary rather than duplicative content.
- Retrieve Candidate Set — Standard retrieval produces top candidates by relevance. The candidate set is the starting point for information-gain reranking.
- Model User Information State — Start with an empty information state. As each result is added, update the state to reflect what the user would now know.
- Estimate Gain Per Candidate — For each unselected candidate, estimate how much new information it adds beyond the current state. Estimation uses semantic similarity, entity coverage, and content novelty.
- Combine Gain With Relevance — Composite score is a weighted combination of gain and relevance. Weights are calibrated per query type.
- Pick Next Result By Composite Score — Greedy selection picks the highest-composite-scoring candidate. The chosen candidate is added to the result list and to the information state.
- Repeat Until SERP Is Filled — Iterate the selection until the SERP slot count is met. Each iteration uses the updated information state, so subsequent picks complement prior ones.
- Render Diversified SERP — The final ranked list goes to the renderer. Users see a SERP where each result adds genuinely new information.
Greedy Information-Gain Reranking
The patent's load-bearing pattern is greedy selection driven by information gain. Each result is picked to maximize the user's incremental knowledge given everything already shown.
Maximize Marginal Value
Each SERP slot should add as much value as possible given prior slots. The greedy gain rule operationalizes this principle directly.
- Information State Modeling — The user's evolving information state captures what they know after each result. Subsequent picks complement, not duplicate.
- Gain Estimation — Semantic similarity, entity coverage, and content novelty produce a gain score per candidate. Cheaper estimators enable real-time reranking.
- Combined Composite Score — Gain combines with relevance via weighted sum. Pure gain would surface novelty; pure relevance would surface duplicates. The combination is balanced.
Technical Foundation
Technical Foundation
The patent specifies the candidate retrieval, the information state model, the gain estimator, the composite scoring, and the greedy selection.
- Candidate Retrieval — Standard text-relevance retrieval produces top candidates. Candidate count is configurable; typical values are 50 to 200.
- Information State Representation — Information state is a structured representation of content covered so far: entities mentioned, claims made, perspectives represented. The state updates incrementally as results are added.
- Gain Estimator — Computes how much a candidate adds beyond the current state. Uses embedding similarity, entity-coverage differences, and topical-novelty signals.
- Composite Score Function — Weighted sum of gain and relevance. Weights are tuned per query type so news weighs gain less and reference weighs gain more.
- Greedy Selection Loop — Iteratively picks the highest-composite candidate, updates the information state, and repeats. The loop runs in linear time over candidate count.
- Caching For Common Queries — For common queries, partial result lists with stable composites are cached. Caching keeps the latency budget manageable for the long tail.
The Process
The Process
The pipeline runs in the SERP composition path. The added latency over standard ranking is small because reranking operates on the already-retrieved candidate set.
- Retrieve Candidates — Standard retrieval produces top-K candidates. K is sized for downstream reranking efficiency.
- Initialize State — Start with empty information state. The first selection is the highest-relevance candidate.
- Add First Result To State — Update the information state to reflect what the user would know after reading the first result.
- Compute Gain For Remaining — Estimate gain for each remaining candidate relative to the current state. Apply composite scoring.
- Pick Next And Update State — Select the highest-composite candidate. Add it to the result list and update the information state.
- Iterate Until SERP Filled — Repeat selection until the SERP slots are filled. Each iteration uses the up-to-date state.
- Render Final List — Composed list of complementary results renders in the SERP. Users see diverse perspectives at the top.
Quality Control
Quality Control
Information-gain reranking can over-diversify or surface off-topic novelty. The patent specifies safeguards.
- Relevance Floor — A minimum relevance threshold is enforced. Candidates below the floor cannot rank even if their information gain is high.
- Gain Estimator Calibration — Gain estimates are calibrated against user-judged diversity. Wrong calibration would over- or under-diversify.
- Per-Query-Type Weighting — Gain weight varies per query type. News queries value freshness over diversity; reference queries value diversity. Per-type calibration.
- Anti-Spam In Diversity Selection — Off-topic novelty (spam claiming to be a different angle) is filtered before reranking. Diversity is real only when candidates are all relevant.
- Position Awareness — Slot 1 prioritizes relevance more than slot 5. The composite weighting can vary by position so the SERP starts with the strongest result.
Real-World Application
Information-gain reranking influences modern Google SERPs through visible diversity in the top results, the 'different angles' rendering in some SERP features, and the multi-perspective answers in AI Overviews.
- Greedy Selection Algorithm — Iterative greedy selection picks each next result to maximize marginal information gain.
- Per-position Weight Variation — Gain weight varies by SERP position. Top slots emphasize relevance; lower slots emphasize diversity.
- Composite Score Form — Final score is a weighted combination of gain and relevance, calibrated per query type.
Why Differentiation Beats Coverage
Information-gain ranking penalizes pages that duplicate top-result content and rewards pages that add unique perspectives, original data, or expert angles. SEO that reads the SERP first and writes what is missing wins on this signal.
Why Original Data Earns Late-SERP Visibility
The further down the SERP a page sits, the harder it must work on information gain. Pages with original data, surveys, expert quotes, or first-hand evidence break through where pure-synthesis content cannot.
<\/section>What This Means for SEO
What This Means for SEO
When the engine ranks results by information gain rather than topical match, your job shifts from covering the query to delivering something the user did not already see on the result above you.
- Differentiation Beats Coverage — A page that repeats what the top three results already said adds zero information gain. Read the SERP first, then write what is missing, not what is already there.
- Unique Data Is A Ranking Lever — Original surveys, internal data, expert quotes, and first-hand evidence raise your information gain score. Pages with proprietary inputs rank above pages that merely synthesize public ones.
- Late-SERP Content Has The Highest Bar — The further down the result list, the more new information a page must supply to justify the click. Pages that struggle to break into the top ten usually fail this test, not the relevance test.