Stores phrase posting lists across tiers and shards so the most-commonly-queried phrases live on fast in-memory tiers while rare phrases live on slower disk tiers, letting the index scale to billions of phrases without blowing the latency budget.
Patent Overview
- Filed
- 2007-03-30
- Granted
- 2010-04-06
- Application Number
- US 11/694,750
The Challenge
The Challenge
A phrase-based index has orders of magnitude more entries than a word-based index. Storing all phrase posting lists in memory is infeasible, but storing them all on disk produces unacceptable query latency. The system needs a tiered storage architecture that puts hot phrases where they can be read fast.
- Phrase Index Is Vastly Larger Than Word Index — Where a word index has millions of entries, a phrase index has hundreds of millions to billions. The size difference makes naive storage strategies impossible at production scale.
- Query Latency Demands Memory Access For Hot Items — Sub-100ms search latency requires that frequently queried phrases be read from memory, not disk. A purely disk-based index cannot meet the latency budget no matter how it is optimized.
- Phrase Query Frequency Follows Power Law — Most phrase queries hit a small set of hot phrases; the long tail is enormous but rarely accessed. Storage cost can be saved by treating hot and cold phrases differently.
- Sharding Distributes Load Across Servers — Even hot-tier memory must be sharded across many servers to handle traffic. The sharding scheme must support efficient routing so queries hit the right shard quickly.
- Updates Must Not Block Reads — The index updates continuously as the crawl refreshes. The architecture must allow incremental updates without blocking query traffic, even on heavily-loaded shards.
Innovation
How The System Works
The patent splits the phrase index into multiple tiers (memory, SSD, disk) based on access frequency, shards each tier across many servers, and routes queries through a hierarchical dispatcher that tries hot tiers first and falls through to slower tiers only for cold-phrase queries.
- Classify Phrases By Frequency Tier — Each phrase is assigned to a frequency tier based on its observed query rate: hot, warm, cold. Hot phrases are queried frequently; cold phrases are rarely queried. Classification updates as access patterns shift.
- Place Tier Storage Appropriately — Hot tier lives in RAM across many memory servers. Warm tier lives on fast SSDs. Cold tier lives on slower spinning disks or compressed archival storage. Each tier has its own cost-performance profile.
- Shard Each Tier Across Servers — Within a tier, phrases are sharded across many servers using a hash of the phrase as the shard key. Sharding distributes both storage and query load evenly across the fleet.
- Route Queries Hierarchically — Query dispatcher first tries the hot-tier shard for the phrase. If found, the posting list is returned immediately. If not, the dispatcher tries warm, then cold. Most queries terminate at the hot tier.
- Process Posting Lists In Parallel — When a query involves multiple phrases, each phrase's posting list is fetched in parallel from its respective shard. The intersection happens in the dispatcher after all lists arrive.
- Handle Updates Incrementally — When the crawl produces new posting list entries, updates are written incrementally to the appropriate tier and shard. Reads continue uninterrupted while writes propagate.
- Re-Tier As Access Patterns Shift — Phrases can move between tiers as their access frequency changes. A newly-trending phrase moves up to the hot tier; a fading phrase falls to cold. Re-tiering happens on a periodic schedule.
Tier By Access Frequency, Shard By Hash
The two-axis decomposition (tier vertically by frequency, shard horizontally by hash) is what makes the phrase index practical at web scale. Either dimension alone falls short; together they yield a storage architecture that scales while preserving latency.
Hot Data In Fast Memory, Cold Data On Cheap Disk
The price-performance gradient across memory, SSD, and disk is steep. By matching phrase access patterns to storage tier, the system buys the right performance at the right cost.
- Tiered Storage — Three tiers (hot RAM, warm SSD, cold disk) match the power-law distribution of phrase query frequency. Hot phrases live where access is fastest; cold phrases live where storage is cheapest.
- Hash Sharding — Within each tier, phrases are sharded by hash across many servers. Load distributes evenly and any single server failure affects only its shard, not the whole tier.
- Hierarchical Routing — Queries try hot first, then warm, then cold. Most terminate at hot. The hierarchical fallback keeps the average latency low while preserving worst-case correctness.
Technical Foundation
Technical Foundation
The patent specifies the tier-classification model, the sharding scheme, the query dispatcher logic, and the incremental update protocol.
- Frequency-Tier Classifier — Each phrase has a measured query rate. Thresholds split phrases into hot, warm, and cold tiers. Re-classification happens periodically as observed rates shift.
- Shard Hash Function — Phrases are sharded via consistent hashing so adding or removing servers requires minimal data movement. The hash distributes load uniformly across the shard fleet.
- Query Dispatcher — The dispatcher accepts incoming queries, decomposes them into phrase lookups, routes each lookup to the appropriate tier-shard, and assembles the responses. Routing happens in microseconds.
- Posting List Format — Posting lists are stored in compact compressed form. Each entry records a document ID and position information. Compression schemes differ per tier to trade decoding cost against storage size.
- Incremental Update Protocol — Updates are append-only within each tier-shard. Periodic compaction merges updates into the main posting lists. Queries during compaction read from both the main list and the pending updates.
- Replication For Availability — Each shard is replicated across multiple physical servers so single-server failures do not cause unavailability. The dispatcher routes around failed replicas automatically.
The Process
The Process
The architecture runs as a production service handling billions of queries per day. Each query traverses the tiered, sharded index in milliseconds, even when phrases are rarely accessed.
- Receive Query — A query enters the dispatcher. The dispatcher parses the query into a set of phrases that need posting-list lookups.
- Compute Tier And Shard Per Phrase — For each phrase, the dispatcher computes the current tier (from the tier-classification table) and the shard (from the hash). Routing is O(1) per phrase.
- Fetch Posting Lists In Parallel — The dispatcher sends parallel requests to all relevant tier-shards. Each shard responds with the requested posting list or a not-found signal.
- Fall Through For Misses — If a phrase missed the hot tier, the dispatcher retries against warm, then cold. Most queries do not need fallback; misses are tracked for re-tiering.
- Intersect And Score Documents — Once all posting lists arrive, the dispatcher intersects them, scores candidate documents, and returns the top results to the caller.
- Log Access Metrics — Per-phrase access frequencies are logged. The metrics feed the next re-tiering pass.
- Periodic Re-Tier And Compact — Periodic background jobs re-tier phrases based on updated access metrics and compact incremental updates into main posting lists. Both run without interrupting query traffic.
Quality Control
Quality Control
Tiered storage and hierarchical routing introduce many failure modes. The patent specifies safeguards that keep the architecture reliable in production.
- Replica Health Monitoring — Each shard's replicas are continuously health-checked. Failed replicas are removed from the dispatcher's routing table within seconds. Queries route to healthy replicas automatically.
- Tier Cache Consistency — When a phrase is promoted to a hotter tier, the cooler tiers may still hold stale versions. The protocol ensures readers always see the freshest version regardless of which tier serves the request.
- Re-Tier Stability — Phrases near a tier boundary could oscillate. Hysteresis margins prevent oscillation, so phrases only move tier when their access frequency clearly belongs to the destination tier.
- Update Visibility Lag Bounds — Writes propagate to all replicas within bounded time. Queries during the propagation window may see slightly stale results, but the lag is monitored and bounded by SLA.
- Capacity Headroom Management — Each tier maintains capacity headroom so unexpected access spikes do not overflow into the next tier uncontrollably. Automatic scaling triggers when headroom drops below thresholds.
Real-World Application
Tiered, sharded posting list architecture is the storage substrate that made Google's phrase-based index practical. Its primitives generalize to any large-scale search system facing similar power-law access patterns.
- 3-tier Storage Hierarchy — Hot RAM, warm SSD, cold disk. The three-tier hierarchy matches the price-performance gradient of available storage and the access pattern of real queries.
- Hash-sharded Distribution Method — Within each tier, phrases shard across many servers via consistent hashing. Load distributes uniformly and rebalances incrementally when servers are added or removed.
- <100ms Query Latency — Even queries that traverse multiple tier fallbacks complete in under 100ms because the hot tier handles the overwhelming majority of phrase lookups directly from RAM.
Why Phrase Coverage Drives Indexing
Because the index is structured around phrases (not just words), content that uses many canonical phrases for its topic is densely represented in the posting lists. This compounds the visibility advantage covered in the Phrase-Based Indexing patent: phrase-rich content gets indexed phrase-rich.
Why Index Latency Affects Crawl Strategy
The architecture's incremental update protocol means freshly-crawled pages enter the hot tier with some lag. Sites that update frequently and earn their phrases tend to stay in the hot tier; sites that publish rarely fall to the cold tier and refresh slowly.
<\/section>What This Means for SEO
What This Means for SEO
Phrase-based indexing means the engine is indexing your content by multi-word phrases, not just individual words.
- Phrase Coverage Replaces Keyword Density — You compete on phrase coverage, not keyword frequency. A piece that covers many canonical phrases for its topic outranks one that repeats a single keyword.
- Co-Occurring Phrases Signal Topical Depth — Phrases that experts on a topic naturally use together signal depth when they co-occur on your page. Compile a phrase glossary for each cluster and check coverage.
- Long-Tail Phrases Are Cheap Real Estate — Specific multi-word phrases face less competition than head terms. A page that nails one long-tail phrase per section captures traffic the head-term competitors miss.