Generates a vector embedding for a query on the fly when no cached embedding exists, then uses it to retrieve documents and content cards by similarity in a shared embedding space.
Patent Overview
- Inventor
- Steven D. Baker
- Assignee
- Google LLC
- Filed
- 2018-10-16
- Granted
- 2022-04-05
- Application Number
- US 16/162,290
The Challenge
Embedding-Based Retrieval Needs A Cache And A Fallback
Vector-based retrieval is powerful but expensive. A production system needs to cache the embeddings of frequent queries while still serving queries that have no precomputed embedding. The fallback must produce a usable embedding fast enough to keep latency acceptable, and the embedding space must be shared across documents, entities, queries, and interests so that similarity is meaningful across all object types.
- Pre-Embedding Every Query Is Impossible — The space of possible queries is open-ended. A precomputed cache cannot cover everything, especially for long-tail and emerging queries.
- Bare Keyword Match Misses Semantic Neighbors — If a query has no embedding and the system falls back to keyword retrieval, it loses access to documents that are semantically related but lexically different. The fallback must be vector-aware.
- Need On-Demand Embedding Generation — The system needs to detect cache misses, generate an embedding for the query terms on the fly, and continue the similarity-based retrieval pipeline as if the embedding had been cached.
- Shared Space Across Object Types — For similarity retrieval to work across queries, documents, entities, and interests, all object types must live in the same vector space. Object-specific embeddings are not sufficient.
- Latency Budget Constrains Generation — On-demand embedding generation must fit within the query-serving latency budget. Heavy-weight models are not feasible at the runtime; lighter, query-tuned models are required.
Innovation
Generate Embeddings On Cache Miss, Then Match
When a query arrives whose embedding is not stored, the system generates an embedding for its terms. It then identifies web documents whose embeddings are similar to the generated embedding and provides associated content cards in the user's feed. The cache plus fallback combination makes vector retrieval feasible for the open-ended query space.
- Receive Query — The query comprises one or more terms supplied by the user or the feed runtime. The system does not assume a cached embedding exists.
- Look Up Stored Embedding — Check whether an embedding for this query is already cached in memory. Cache hit returns immediately and skips generation.
- Generate If Missing — If no stored embedding exists, generate a new embedding for the query terms using the embedding model. The generation runs in line with query serving.
- Find Similar Document Embeddings — Identify web documents whose embeddings are similar to the generated query embedding under the chosen similarity metric. The retrieval uses approximate nearest-neighbor algorithms for scale.
- Find Similar Entity Embeddings — In parallel, identify entities whose embeddings are similar to the query embedding. Entity matches feed entity-rich card formats.
- Provide Content Cards In Feed — Surface content cards associated with the similar documents and entities in the user's content feed. Cards are ranked by similarity and other quality signals.
- Cache Newly Generated Embedding — Optionally cache the freshly generated embedding so subsequent queries with the same terms hit the cache. The cache grows organically with usage patterns.
Embeddings As The Unified Retrieval Substrate
The patent positions embeddings not as a research curiosity but as the substrate of a production search-and-feed service. Queries, documents, entities, and user interests all live in the same vector space, and similarity is the universal retrieval operation. The cache-plus-generate architecture is what makes this practical.
One Geometry, Many Objects
Documents, entities, queries, and interests are all represented as n-dimensional vectors that can be compared to each other using the same similarity functions. The shared geometry is what makes cross-object retrieval coherent.
- Entity Embeddings — Each known entity has its own embedding, computed from its associated documents and structured data. Entities are first-class objects in the retrieval space.
- Document Embeddings — Each web document has an embedding computed from its content, used to retrieve similar documents and similar entities. Document embeddings are typically computed offline and indexed for similarity search.
- Query And Interest Embeddings — Queries and user interests are embedded into the same space, enabling personalized retrieval to be expressed as a single similarity search. Query embeddings are cached when possible, generated on demand when not.
Search becomes a geometry problem. Documents and entities are points; queries and interests are points; the answer is the nearest neighbors.
<\/section>Technical Foundation
The Shared Embedding Space
Everything lives in n-dimensional vector space and is compared with cosine similarity, dot product, or Euclidean distance. The choice of metric depends on the kind of similarity desired.
- Query Embedding — An n-dimensional vector for one or more query terms, generated on demand when no cached version exists. The embedding model can be a transformer, a learned bilinear model, or a hybrid.
- Document Embedding — An n-dimensional vector for each indexed web document, computed offline and stored. Computed from the document's content, structure, and possibly external signals.
- Entity Embedding — An n-dimensional vector for each known entity, computed from the entity's associated documents and knowledge-graph attributes. Used for entity-level similarity retrieval.
- Similarity Function — A standard vector similarity (cosine, dot, Euclidean) used to find documents close to the query embedding. The choice influences both retrieval quality and computational cost.
Quality Metrics
- Cosine Similarity — Robust to vector magnitude, useful when only direction matters. The most common choice for embedding retrieval.
cos(A, B) = (A · B) / (|A| * |B|) - Dot Product — Useful when vector magnitude carries meaning (e.g., confidence). Less common than cosine for query-document matching.
dot(A, B) = A · B - Cache Hit Rate — The Golden Embeddings approach makes the system robust to low hit rates by generating embeddings on demand. Higher cache hit rates reduce average serving latency.
hit_rate = |cache hits| / |total queries|
Key Insight: The unification of objects in a shared embedding space is what makes personalization and retrieval the same problem. Without a shared space, personalization needs a separate matching layer. With one, the same nearest-neighbor query finds documents for an explicit query and content for an inferred interest, just by changing the source vector.
<\/section>The Process
Runtime Query Path
The query path is short by design because embedding generation must fit within the query-serving latency budget.
- Query Arrives — A user query reaches the embedding pipeline either directly from search or via the feed runtime.
- Cache Probe — Probe the embedding cache for the query's terms. Cache hit returns the stored embedding immediately.
- On-Demand Generation — Cache miss triggers embedding generation using the runtime model. The generated embedding is used immediately and optionally cached for future use.
- Approximate Nearest Neighbor Search — Run an approximate nearest-neighbor search against the document and entity embedding indexes. Return the top-k candidates by similarity.
- Card Construction — For each candidate document or entity, construct a content card with appropriate metadata. Cards include source attribution and entity-rich elements.
- Rank And Render — Rank the cards by combined similarity and quality signals. Render the top cards to the user's feed surface.
What This Means for SEO
What This Means for SEO
Vector retrieval is now production infrastructure, not just an experimental technique. Knowing that the system can place your documents in a shared space with queries, entities, and user interests reframes how you should think about topical authority, content coherence, and entity association.
- Semantic Similarity Beats Keyword Match — Documents that are semantically close to a query but lexically different are retrievable. Content can compete for queries it does not literally contain, as long as its embedding lands close to the query embedding.
- Entity Embeddings Compound With Content — Strongly associating your content with a recognized entity puts you in the entity's neighborhood in the shared space. Queries that match the entity, even loosely, surface your content because document and entity proximity is measured in the same space.
- Personalization Is Geometric — User interests are vectors. Your content is selected for a user's feed when your document embedding sits close to the user's interest embedding. Topical clarity, not surface keyword density, drives that proximity.
- Document-Level Coherence Matters More Than Ever — Embeddings reward coherent documents whose content forms a tight semantic cluster. Off-topic drift dilutes the vector and pushes you away from the entities and interests you want to be close to.
- Pages About Multiple Entities Sit Between Their Embeddings — A page that covers multiple entities will have an embedding somewhere between them in the shared space. This is fine for hub pages but problematic for pages that should claim a single entity neighborhood.
- Structured Data Sharpens The Embedding — Explicit entity markup (schema, structured data) gives the embedding model unambiguous signals about which entities your page is about. This sharpens your position in the vector space and tightens your proximity to the right neighbors.
- Long-Form Coherent Content Beats Thin Pages — Embeddings reward documents that have enough content to establish a clear semantic position. Very short pages have noisy embeddings that drift across topics. Length plus coherence is the winning combination.
- Cross-Language Embedding Spaces Exist Too — Multilingual embedding models put translated concepts near each other in the shared space. Localized content can inherit topical authority from its English counterpart through this proximity, complementing what the cross-language related-term patent describes.