Two-stage decoding that first predicts a short sequence of discrete latent variables, then decodes the output tokens in parallel conditioned on those latents. This is the production-latency trick that makes LLM-served search features like AI Overviews and Gemini responses economically viable at Google query scale.
Patent Overview
- Inventor
- Lukasz Kaiser, Ashish Vaswani, Noam Shazeer, Aurko Roy, Samy Bengio, Niki Parmar
- Assignee
- Google LLC
- Filed
- 2019-02-11
- Granted
- EP granted; published 2020-11-04
The Challenge
The Challenge
Autoregressive Transformer decoding generates one token at a time. Each token must wait for all previous tokens to be sampled before it can be computed. This sequential bottleneck caps throughput and inflates per-query latency, making naive Transformer decoding too expensive to serve at web-search scale.
- Sequential Decoding Is The Bottleneck — Per token, the model must wait for the previous token. Parallel hardware sits idle during inference.
- Latency Scales With Output Length — Per output, longer answers cost linearly more wall-clock time. Long-form answers become uneconomic.
- Non-Autoregressive Alone Drops Quality — Per output, fully parallel decoding without conditioning loses coherence. Quality degrades sharply.
- Continuous Latents Are Hard To Learn — Per training, continuous latent spaces collapse or ignore the conditioning signal. The latent must be discrete and informative.
- Serving Cost Blocks LLM Search — Per query, full autoregressive cost makes per-query LLM generation uneconomic at search volume.
Innovation
How The System Works
The system splits decoding into two stages. First, an autoregressive sub-model predicts a short sequence of discrete latent variables that compactly describes the target output. Second, a decoder consumes those latents and emits the full output sequence in parallel. The short latent sequence preserves coherence; the parallel emission preserves speed.
- Encode Source — Per input, encoder produces source representations.
- Learn Discrete Latent Codebook — Per training, target sequences map to short discrete latent sequences via vector quantization.
- Predict Latents Autoregressively — Per query, a small autoregressive model emits the short latent sequence.
- Condition Decoder On Latents — Per token position, decoder reads source plus latents as conditioning.
- Decode Output In Parallel — Per output, all tokens are emitted in one or few non-autoregressive passes.
- Refine If Needed — Per output, optional iterative refinement passes correct residual errors.
- Return Final Sequence — Per query, full output delivered with near-autoregressive quality.
Latent Plan, Then Parallel Fill
The load-bearing idea is that the model commits to a compressed plan before generating tokens. The plan is short enough to decode sequentially without latency pain. The tokens are conditioned tightly enough on the plan to be emitted in parallel without losing coherence.
Compress The Plan, Parallelize The Prose
Per output, the discrete latents carry the structural commitment. Per output, the tokens are then a parallel realization of that commitment.
- Discrete Latent Plan — Short codebook sequence captures output shape.
- Parallel Token Emission — Output tokens decoded in one pass.
- Near-Autoregressive Quality — Plan conditioning preserves coherence.
Technical Foundation
Technical Foundation
The patent specifies an encoder, a discrete latent quantizer, a latent predictor, a non-autoregressive decoder, an optional refiner, and a training objective that ties the three stages together.
- Source Encoder — Per input, Transformer encoder produces contextual representations.
- Discrete Latent Quantizer — Per training, vector quantization maps targets to a discrete codebook.
- Latent Predictor — Per query, small autoregressive model emits the latent sequence.
- Non-Autoregressive Decoder — Per output, parallel decoder conditioned on source plus latents.
- Optional Refiner — Per output, iterative pass corrects residual errors.
- Joint Training Objective — Per training, reconstruction plus latent prediction losses are optimized together.
The Process
The Process
Training learns the codebook and the two predictors jointly. Inference splits cleanly into the short autoregressive latent pass and the parallel token pass.
- Tokenize Source — Per input, source converted to tokens.
- Encode Source — Per input, encoder produces representations.
- Predict Latent Sequence — Per query, latent predictor emits short discrete codes.
- Expand To Output Length — Per output, decoder positions are conditioned on latents.
- Emit Tokens In Parallel — Per output, all tokens decoded in one pass.
- Refine — Per output, optional refinement pass.
- Return Output — Per query, final sequence returned with low latency.
Quality Control
Quality Control
The two-stage split risks losing quality versus the fully autoregressive baseline. The patent specifies safeguards that keep the gap small.
- Codebook Capacity Tuning — Per training, codebook size balances expressiveness and prediction difficulty.
- Latent Length Schedule — Per output, latent sequence length tuned for compression versus fidelity.
- Iterative Refinement — Per output, refinement passes recover from parallel decoding errors.
- Joint Loss Balancing — Per training, latent prediction and token reconstruction losses are weighted.
- Quality Versus Latency Gating — Per deployment, refinement count gates the latency-quality tradeoff.
Real-World Application
Fast decoding via discrete latents is the production-latency primitive underneath LLM-served search features. Without this class of technique, per-query Transformer generation at search volume is uneconomic. The pattern shows up across translation, summarization, and the LLM-served answer surfaces Google now ships.
- Short latent plan Compression Stage — Discrete codes capture output structure.
- Parallel emission Decoding Stage — Output tokens generated in one pass.
- Near-AR quality Quality Achievement — Latent conditioning preserves coherence.
Why Per-Query LLM Answers Became Tractable
Per query, fully autoregressive generation at Google search volume is cost-prohibitive. Fast-decoding techniques like this one make LLM-served answers tractable to serve.
Why Plan-Then-Fill Beats Plan-Free Parallel
Per output, plan-free non-autoregressive decoding loses coherence. Conditioning parallel decoding on a discrete latent plan preserves output quality.
<\/section>What This Means for SEO
What This Means for SEO
Fast decoding is the unglamorous infrastructure decision that turned LLM-served search from a research demo into a routine SERP outcome. The lossy compression that makes it fast also shapes which sources and which content structures survive intact through the pipeline.
- LLM-Served Answers Are Now Routine — Fast decoding makes AI Overviews, SGE, and Gemini responses economically viable at Google's query volume. Without this class of technique, per-query LLM generation at search scale would be cost-prohibitive. SEO strategy must treat LLM-served answers as a routine, not occasional, SERP outcome.
- The Latent Plan Rewards Structured Content — The discrete latent stage commits to a high-level answer structure before tokens are generated. Content with clear headings, summaries, and FAQ-style organization maps more cleanly into that plan than sprawling unfocused prose. Outline-shaped writing aligns with how the system thinks.
- Lossy Compression Punishes Sprawl — The short discrete latent sequence is a lossy compression of the target. Content that is summarizable, self-contained, and tightly organized survives that compression cleanly. Dense, meandering, hedged content loses signal in the squeeze.
- Citation Routing Depends On Plan Sources — Which sources the AI Overview cites depends on which sources the latent plan and decoder draw from when constructing the answer. Being a clean, citeable, structurally legible source raises the odds of being the source the plan reaches for.
- There Is A Ceiling On Response Richness — The latency-quality tradeoff of fast decoding caps how rich a single served response can be. Concise, definitive, well-cited answers fit the budget better than dense exhaustive prose. Aim to be the source that fills the budget cleanly, not the longest source on the topic.
- Outline-First Content Aligns With The Mechanism — The plan-then-fill pattern resembles drafting an outline and then writing to it. Content organized that way, with explicit outline-level structure and clean section commitments, aligns with how the model commits to and then realizes an answer.
- Production Latency Sets The Quality Bar — Fast decoding is what brought frontier LLMs into per-query search serving. The systems judging and quoting your content are running at production latency, not research bench speed. The quality bar for being chosen as the cited source rises accordingly.