Incremental update scheme for the hyperlink database that backs web-scale link analysis, supporting fast updates as crawls discover new links without rebuilding the full graph, with fault tolerance for the distributed storage layer.
Patent Overview
- Inventor
- Marc Najork, others
- Assignee
- Microsoft Corporation
- Filed
- 2006-04-21
- Granted
- 2007-10-25 (published application)
- Application Number
- US 11/408,852
The Challenge
The Challenge
The web's hyperlink graph is the foundation for PageRank and every link-based ranking signal. The graph has billions of edges and grows continuously as the crawler discovers new content. Maintaining it requires incremental updates that scale without rebuilding the full graph on every crawl pass.
- Full Rebuild Does Not Scale — Web-scale link graphs have billions of edges. Rebuilding the graph on every crawl pass takes hours; the system cannot pause link-dependent ranking that long.
- Crawls Discover New Links Continuously — Every crawl pass discovers new pages and new outbound links. The database must accept these as deltas without full reorganization.
- Distributed Storage Requires Coordination — Link data spans many storage nodes. Updates must coordinate across nodes without performance regressions.
- Reads Continue During Updates — PageRank computation and ranking queries hit the link database constantly. Updates cannot block reads or the system stalls.
- Faults Must Not Corrupt The Graph — Hardware failures, network partitions, and software bugs happen at scale. The update scheme must handle faults without data corruption or inconsistency.
Innovation
How The System Works
The patent defines an incremental update scheme that accepts link deltas (additions, deletions, modifications) from crawls, applies them to the distributed graph store without blocking reads, replicates updates across storage nodes for fault tolerance, and recovers from faults via versioned snapshots and incremental replay.
- Crawl Produces Link Deltas — Each crawl pass produces a stream of link deltas: new edges, deleted edges, target changes. Deltas batch into update transactions.
- Apply To Distributed Storage — Deltas distribute to the storage nodes responsible for the affected edges. Sharding by source URL or hash routes updates efficiently.
- Maintain Read Availability — Updates apply via versioned writes: new version of the affected region replaces the old, with reads continuing against the prior version until the new version is durable.
- Replicate For Fault Tolerance — Each storage node replicates to peers. Replication ensures durability against single-node failures and allows read load balancing.
- Snapshot Periodically — Periodic snapshots of the link database capture full state. Snapshots are the recovery points for fault-tolerant rebuild.
- Recover From Faults — On fault detection (corrupted shard, failed node), the system rebuilds the affected region from snapshots plus incremental replay. Service continues with degraded capacity during recovery.
- Update Consumers Continuously — PageRank computation, link-based ranking, and other consumers read the link database continuously. Incremental updates are visible within bounded latency.
Incremental Updates At Web Scale
The patent's load-bearing idea is to make the hyperlink database an incrementally-updated distributed store rather than a periodically-rebuilt monolith. Incremental updates plus distributed replication plus snapshot recovery handle web-scale evolution.
Maintain, Do Not Rebuild
Full rebuilds are too expensive at web scale. Maintaining the graph incrementally, with periodic snapshots for recovery, keeps the database continuously fresh without prolonged downtime.
- Incremental Delta Application — Crawl deltas apply incrementally. Most pages change rarely; updating only the affected regions scales much better than full rebuilds.
- Replicated Distributed Storage — Link data shards across storage nodes with replication. Single-node failures do not lose data; reads load-balance across replicas.
- Snapshot-Based Recovery — Periodic snapshots plus incremental replay support fault recovery. Service continues with degraded capacity during rebuild.
Technical Foundation
Technical Foundation
The patent specifies the link delta format, the distributed storage layer, the versioned write protocol, the replication scheme, the snapshot system, and the recovery procedures.
- Link Delta Format — Standardized delta format: source URL, target URL, operation (add, delete, modify), timestamp. Deltas batch into update transactions.
- Distributed Storage Layer — Sharded by source URL or hash. Each shard stores its assigned subset of the link graph. Storage is keyed for fast lookup and update.
- Versioned Write Protocol — Writes create new versions of affected regions. Reads continue against prior versions until new versions become durable. No blocking.
- Replication Scheme — Each shard replicates to peers. Replication factor handles single-node failures with no service impact.
- Snapshot System — Periodic full snapshots of all shards. Snapshots compress and store durably. They are the recovery points.
- Recovery Procedures — On fault detection, affected shards rebuild from snapshots plus incremental replay. Recovery happens in parallel across shards for speed.
The Process
The Process
The pipeline runs continuously, with crawl deltas flowing in and consumers reading out. The system maintains a continuously-updated link graph without rebuild downtime.
- Crawler Produces Deltas — Each crawl batch produces link deltas. Deltas stream to the update pipeline.
- Route To Storage Nodes — Deltas route to storage nodes responsible for the affected edges. Sharding by source URL distributes load.
- Apply Versioned Writes — Each storage node applies deltas via versioned writes. New version becomes available when durable.
- Replicate To Peers — Each shard's writes replicate to peer nodes. Replication ensures durability.
- Consumers Read Latest Version — PageRank and ranking consumers read the latest durable version. Reads load-balance across replicas.
- Snapshot Periodically — Periodic full snapshots run as background tasks. Snapshots store durably for recovery.
- Recover On Fault — Fault detection triggers shard rebuild from snapshot plus incremental replay. Service continues with degraded capacity during rebuild.
Quality Control
Quality Control
Inconsistent or corrupt link data poisons every link-based signal. The patent specifies safeguards.
- Versioned Write Consistency — Versioned writes ensure readers see consistent state. Half-applied updates do not leak to consumers.
- Replication Health Monitoring — Per shard, replica health monitors continuously. Failed replicas trigger replacement before they cause data loss.
- Snapshot Validation — Each snapshot validates for completeness and consistency. Bad snapshots are not used for recovery.
- Recovery Testing — Periodic recovery drills test the snapshot-plus-replay pipeline. Drills catch issues before they matter in production.
- Cross-Shard Consistency — Cross-shard consistency checks detect anomalies (broken edge references, dangling pointers). Detected issues trigger investigation.
Real-World Application
Incremental hyperlink-database architecture is foundational to every web-scale search engine. The primitives in this patent inform how engines maintain billions-of-edges link graphs continuously, without rebuild downtime that would freeze ranking.
- Incremental Update Model — Crawl deltas apply incrementally. Full rebuilds are avoided.
- Distributed-replicated Storage Architecture — Sharded distributed storage with replication. Scales to web-size graphs with fault tolerance.
- Snapshot-recoverable Recovery Model — Periodic snapshots plus incremental replay support fault recovery without data loss.
Why Crawl Latency Affects Link Signal Visibility
New links visible in the crawl propagate to the link database with bounded latency. Sites that earn links quickly see the link signal contribute to ranking within hours to days, not months. Crawl frequency on the linking site matters for how fast new link signal reaches ranking.
Why Sustained Link Acquisition Beats Bursts
The incremental update scheme processes link deltas continuously. Sustained slow link acquisition produces continuous positive signal contribution; bursts of link acquisition look anomalous in the temporal data and trigger spam-pattern detection (per other Najork patents).
<\/section>What This Means for SEO
What This Means for SEO
This patent maintains the web's link graph through incremental delta updates rather than full rebuilds, with replication and snapshot recovery. SEO implication: new links propagate into ranking within hours to days once crawled, and sustained acquisition reads as healthy while bursts look anomalous.
- New Links Propagate Quickly Once Crawled — Crawl deltas apply incrementally with bounded latency, so earned links can contribute to ranking within hours to days. Link building has a faster feedback loop than a periodic-rebuild model would allow.
- Crawl Frequency Of The Linking Site Matters — A new link's signal reaches ranking only after the linking page is crawled. Links from frequently-crawled, active sites register faster than links from rarely-crawled pages.
- Sustained Acquisition Reads As Healthy — Continuous delta processing means slow, steady link earning produces continuous positive signal. A consistent acquisition curve is the natural pattern the system expects.
- Bursts Look Anomalous — Sudden bursts of link acquisition stand out in the temporal data and can trigger spam-pattern detection in related systems. Avoid concentrated link spikes without a genuine triggering event.
- The Link Graph Is Always Current — Maintained incrementally rather than rebuilt, the graph stays continuously fresh with no downtime. There is no stale window to exploit; link changes are reflected on an ongoing basis.
- Lost Links Are Processed Too — Deltas include deletions, so links you lose are removed from the graph. Link signal is not permanent; maintaining live, relevant links matters as much as acquiring them.
- Scale Does Not Slow Updates — Sharded, replicated storage handles billions of edges without rebuild downtime. The system keeps up with the web's growth, so do not assume large scale delays your link signals.