Refines PageRank by weighting each link according to the probability that a real user would actually click it, derived from anchor text, link position, font, surrounding context, and historical click data instead of treating every outbound link as equally probable.
Patent Overview
- Filed
- 2004-05-25
- Granted
- 2010-05-11
- Application Number
- US 10/853,179
The Challenge
The Challenge
Original PageRank assumes every link on a page is equally likely to be followed. Real users do not behave that way. They click prominent in-body links and ignore footer boilerplate. Modeling links uniformly leaves a lot of signal on the table and lets manipulators exploit the gap.
- Uniform Link Probability Is Wrong In Practice — A user on a content page does not roll a die over every outbound link with equal probability. They follow the in-body link in the first paragraph far more often than the small grey link in the footer. PageRank's uniform assumption ignores this completely.
- Boilerplate Links Inflate Rank Unfairly — Site-wide footer and navigation links appear on every page of a domain. Under uniform PageRank they accumulate enormous incoming-link counts even though almost nobody clicks them. The signal is structurally inflated.
- Manipulation Targets The Uniform Assumption — Paid-link networks and link-farm patterns exploit the assumption by burying links in dense outbound link blocks. Every link in the block gets the same uniform weight, even though the target link is the only one anyone is meant to follow.
- Anchor Text Carries Strong Click Signal — Descriptive anchor text predicts whether a link gets clicked. A link reading 'Wikipedia article on quantum mechanics' is more clickable than 'click here' on the same page. Uniform-link PageRank discards this signal entirely.
- Position And Visual Style Predict Behavior — Link font size, color, position on the page, surrounding context, all correlate with click probability. None of these signals reach PageRank in its original form. The reasonable surfer model wants to fold them all in.
Innovation
How The System Works
Each link gets a per-link weight predicting the probability a real user would click it. The weight is computed from features of the link itself, the source page, and observed user behavior. PageRank's uniform 1/N becomes a learned distribution over outbound links.
- Extract Per-Link Features — For every link on every page, the system extracts features: anchor text, font size, position on the page, surrounding text, link target type, whether the source-target pair has been seen before. Features are computed at index time.
- Train A Click-Probability Model — A machine-learning model is trained against actual user behavior (toolbar data, query logs, or other observed clicks) to predict the probability that a user on the source page clicks each link. Features are the inputs, observed clicks are the targets.
- Score Every Link With Its Click Probability — At indexing time, every link on every page is scored with its learned click probability. The output is a per-link weight in the range zero to one. The set of weights for one page sums to less than one because some users do not click anything.
- Replace Uniform Weights In PageRank — When computing PageRank, the uniform 1/N distribution over outbound links is replaced with the learned weight distribution. A high-click in-body link contributes more rank than a low-click footer link.
- Iterate To A New Steady State — The modified PageRank iteration converges to a different steady state, one that weights pages by realistic surfer behavior rather than by uniform link counts. High-rank pages are now those that real users would plausibly reach.
- Combine With Other Ranking Signals — The reasonable surfer score is one input to the larger ranking pipeline. It is combined with text-match, freshness, and other signals to produce the final query-time ranking.
- Update Features As Page Layouts Change — When source pages are recrawled, link features are re-extracted. If a previously prominent link is moved to the footer, its weight drops on the next iteration. Rank flow reflects the current layout, not a frozen snapshot.
Click Probability Is The New Link Weight
The single load-bearing idea: a link is not a binary edge in a graph, it is an edge weighted by how likely it is to be traversed. Replacing the uniform weight with a learned probability turns PageRank into a behavioral model.
From Topology To Behavior
Original PageRank cared about the shape of the link graph. The reasonable surfer model cares about how real users move through that shape. Every other property of the patent flows from this shift.
- Link Features Predict Behavior — Anchor text, position, font, surrounding context, link history. Each feature carries information about click probability. The model learns to combine them into a single weight per link.
- Behavior Data Trains The Model — Toolbar data, query logs, and observed click streams provide ground truth. The model fits the link-feature mapping to actual user behavior at scale, so the predictions track reality rather than intuition.
- Manipulation Cost Rises — Inflating rank requires not just earning a link but earning a link a real user would click. Anchor text, position, and prominence are all visible to users, so a placed link aimed at gaming the model is also visible to readers.
Technical Foundation
Technical Foundation
The patent specifies the feature extraction, the learned model architecture, and the integration with the existing PageRank pipeline. Each component is engineered to scale across the full web crawl.
- Link Feature Vector — For each link, a vector of features is computed at index time: anchor text tokens, font size, font weight, link position (top, body, footer, sidebar), surrounding text words, target URL features, source-target history.
- Click-Probability Classifier — A learned model (logistic regression in early versions, later more sophisticated) maps the feature vector to a probability in zero to one. The model is trained against observed click data from logged user sessions.
- Per-Page Weight Distribution — The link weights for outbound links on a page do not sum to one, they sum to the fraction of users who click any link at all. The remainder is treated as a uniform jump, similar to PageRank's damping mass.
- Modified PageRank Iteration — The iteration equation becomes r_new = d times (W times r) plus (1 - d) times u, where W is the learned weight matrix instead of the uniform-normalized link matrix M. Convergence behavior is similar to original PageRank.
- Feature Re-extraction On Recrawl — When source pages are refreshed, features are recomputed, weights are updated, and the iteration is rerun. The system stays in sync with current page layouts.
- Per-Feature Importance Auditing — Because the model is learned, individual features can be evaluated for their contribution. Position dominates anchor text on some link types; on others the relationship is reversed. The patent contemplates this kind of diagnostic.
The Process
The Process
Production reasonable-surfer ranking adds a feature-extraction and weight-prediction stage to the existing PageRank pipeline. The iteration cost rises modestly; the model training is offline.
- Crawl And Parse Pages — Pages are fetched and parsed. Each link is identified along with its surrounding HTML and visual context. The parser preserves the information needed for feature extraction.
- Extract Per-Link Features — For every link, the feature vector is computed. Anchor text, position, font, neighbors, target type. Features are written alongside the link graph.
- Score Links With The Trained Model — The pre-trained click-probability model is applied to each feature vector. Each link receives a scalar weight. The weights for one source page form a probability distribution over its outbound links.
- Build The Weighted Link Matrix — The classical link matrix is replaced with a matrix of learned weights. Each entry W[i][j] is the probability that a user on page i clicks the link to page j.
- Iterate The Modified PageRank Equation — Run r_new = d times (W times r) plus (1 - d) times u until convergence. The math is identical to PageRank; only the matrix is different.
- Publish The Rank Vector — The converged rank is written to the index. The query path reads it as a static feature for every candidate document, same as classical PageRank.
- Retrain Periodically — As user behavior shifts, the click-probability model drifts out of date. The patent describes periodic retraining on fresh click data to keep predictions calibrated.
Quality Control
Quality Control
The model is only as good as its training data and its features. The patent describes specific defenses against feature drift, click-stream manipulation, and model degradation.
- Click Data Cleaning — Bot traffic, automated clicks, and other non-human signal must be filtered before training. The patent describes heuristics to identify and discard suspicious sessions so the model learns from real users.
- Position Bias Correction — Users click higher-positioned links more often regardless of relevance. The model is calibrated to separate position effect from content effect so a footer link with strong anchor text is not undervalued.
- Feature Sanity Bounds — Extracted features are clipped to reasonable ranges to prevent outliers from dominating model output. A page with a one-million-pixel font weight does not produce an absurd click probability.
- Model Stability Monitoring — The distribution of predicted weights is monitored across crawl refreshes. Sudden shifts trigger investigation, since they often indicate a feature pipeline regression rather than a real change in user behavior.
- Cross-Domain Generalization — Models are evaluated on held-out domains to ensure they generalize beyond the training set. A model that only works on the sites it was trained on would fail when applied to the open web.
Real-World Application
Reasonable surfer became a core layer in Google's link analysis stack within a few years of grant. Its influence is most visible in how anchor text and link position have come to matter so much in SEO outcomes.
- 5-10x Effective Link Value Ratio — A prominent in-body editorial link can carry many times the weight of a footer or sidebar link from the same source. The exact multiplier depends on features but the order of magnitude is well-attested.
- 0 Effective Weight Of Pure Boilerplate — Site-wide footer links that nobody clicks effectively receive zero weight under the reasonable surfer model. They no longer inflate rank for their targets, removing a class of low-effort link manipulation.
- 100% Coverage Of The Link Graph — The model is applied to every link on every crawled page. There is no sampling or opt-in. Every link in the index has a learned weight.
Anchor Text Becomes A Ranking Lever
Before reasonable surfer, anchor text mattered mostly as a keyword signal. After, it matters as both a keyword signal and a click-probability input. SEO practice around anchor diversity, naturalness, and descriptiveness traces directly to this patent's signaling weight.
Link Placement Matters More Than Link Count
A single editorial link in the first paragraph of an authoritative article outperforms dozens of links from footer modules. The reasonable surfer is the reason. Modern link-building strategy that prioritizes placement over volume is a direct consequence.
<\/section>What This Means for SEO
What This Means for SEO
The reasonable surfer model weights links by the probability a real user would click them, so where a link sits on the page matters as much as that it exists.
- Link Position Modulates Authority Flow — A footer link transfers less authority than an in-body editorial link. Audit not just inbound links, but where they sit on the referring page.
- Anchor Text Plus Surrounding Context Matters — The model considers the anchor and the words around it. A descriptive anchor in an informative paragraph beats a generic "click here" anchor every time.
- Link Density Hurts Per-Link Value — A page with one link to you transfers more authority than a page with twenty links of which one is to you. Pursue editorial mentions in low-link-density contexts.