The patent replaces convolution with self-attention as the primary building block of vision models, so each spatial location learns a content-dependent receptive field. It is the architectural ancestor of Vision Transformers and the attention-only stacks Google now uses for Image Search, Lens, multimodal grounding, and video understanding.
Patent Overview
- Inventor
- Jonathon Shlens, Ashish Vaswani, Prajit Ramachandran, Niki Parmar, Aravind Srinivas, Irwan Bello
- Assignee
- Google LLC
- Filed
- 2025-06-02
- Granted
- Published 2025-09-18
The Challenge
The Challenge
Convolutional neural networks built the first generation of computer vision. Convolutions assume locality and translation invariance: a fixed kernel slides over the image and aggregates only nearby pixels. That bias is efficient but blind to long-range relationships. A logo in one corner cannot inform classification of a product in the opposite corner without stacking many layers, and even then the receptive field is the same for every image. Vision needed a primitive that could attend globally and adapt its receptive field to image content.
- Convolutions Are Local By Construction — Per layer, a convolution kernel aggregates only a fixed spatial neighborhood. Long-range dependencies require deep stacks.
- Receptive Fields Are Fixed, Not Content-Aware — Per kernel, the receptive field is the same regardless of what is in the image. The model cannot widen its view when the scene demands it.
- Inductive Bias Limits Ceiling — Per architecture, translation invariance helps small data but caps the ceiling on large data and complex scenes.
- Vision And Language Used Different Primitives — Per stack, language used attention while vision used convolution. The split blocked unified multimodal architectures.
- Global Context Was Bolted On — Per network, global pooling or non-local blocks were add-ons rather than the core primitive. The patent makes attention the primitive itself.
Innovation
How The System Works
The system replaces each convolution layer with a stand-alone self-attention layer that operates over spatial positions. Per pixel, the model computes queries, keys, and values from a local or global memory block, then attends to compute a content-dependent output. The receptive field is learned per input rather than fixed per kernel.
- Tokenize Spatial Positions — Per image, each spatial location becomes a token with channel features.
- Project Queries, Keys, Values — Per position, learned projections produce Q, K, V vectors.
- Define Attention Memory Block — Per position, a memory neighborhood (local window or full image) supplies the keys and values.
- Compute Attention Weights — Per query, softmax over K dot products yields content-dependent attention weights.
- Aggregate Values — Per position, the output is a weighted sum of V vectors. The receptive field is learned, not fixed.
- Add Relative Position Encodings — Per attention head, relative position information preserves spatial structure without locking in a fixed kernel.
- Stack Into A Full Vision Backbone — Per network, attention layers replace convolution layers throughout the stem and body, yielding a fully attentional vision model.
Attention Is The Vision Primitive
The load-bearing idea is that self-attention can stand alone as the core vision operator. Not as a global-context add-on on top of convolutions, but as a complete replacement that learns content-dependent receptive fields end to end.
Content-Dependent Receptive Fields
Per spatial location, the model decides what to attend to based on image content. A texture-heavy patch attends locally; a scene with a logo and a caption attends across the whole image.
- Attention Replaces Convolution — Per layer, self-attention is the primary operator. Convolution is no longer required.
- Learned Receptive Fields — Per input, the receptive field adapts to content rather than staying fixed.
- Unified With Language Stacks — Per architecture, vision and language now share the same primitive. Multimodal grounding becomes natural.
Technical Foundation
Technical Foundation
The patent specifies the spatial tokenizer, attention block, memory neighborhood, relative position encoding, multi-head structure, and the full attentional backbone.
- Spatial Tokenizer — Per image, channel features at each spatial location act as tokens.
- Self-Attention Block — Per token, Q, K, V projections plus softmax aggregation.
- Memory Neighborhood — Per position, a local window or full-image memory supplies keys and values.
- Relative Position Encoding — Per head, spatial offsets enter the attention score to preserve geometry.
- Multi-Head Attention — Per layer, multiple heads attend to different content patterns in parallel.
- Fully Attentional Backbone — Per network, attention layers compose the entire stem and body. Convolution is eliminated.
The Process
The Process
Training and inference follow a standard supervised pipeline with attention layers as the primitive.
- Define Attention Backbone — Per network, stack stand-alone attention layers with relative position encodings.
- Initialize Parameters — Per layer, Q, K, V projections and position encodings are initialized.
- Train On Labeled Imagery — Per batch, classification or detection loss backpropagates through the attention stack.
- Tune Memory Neighborhood — Per layer, the local window size or global span is tuned for compute and accuracy.
- Validate — Per epoch, validation accuracy on classification, detection, or segmentation is tracked.
- Deploy Backbone — Per task, the trained attentional backbone is reused across classification, detection, and segmentation heads.
- Iterate Across Generations — Per generation, the same primitive scales to larger models, higher resolutions, and video.
Quality Control
Quality Control
Stand-alone attention has more flexibility than convolution and therefore needs explicit safeguards on locality, position, and compute.
- Memory Block Sizing — Per layer, the attention neighborhood is sized to balance accuracy and compute.
- Relative Position Discipline — Per head, relative position encodings prevent the model from losing spatial structure.
- Multi-Head Diversity — Per layer, heads are regularized to attend to distinct content patterns.
- Backbone Reuse Across Tasks — Per backbone, the same network is validated on classification, detection, and segmentation to confirm generality.
- Continuous Benchmarking — Per generation, results are compared against convolutional baselines to confirm the attention-only stack holds or beats parity.
Real-World Application
Fully attentional vision is the architecture that powers modern Google image and video understanding. Image Search, Google Lens, image-pack ranking, object detection in Shopping, and multimodal grounding for Gemini all sit on attention-only vision stacks descended from this patent.
- Attention-only Backbone — Convolution is replaced as the primary operator.
- Content-aware Receptive Field — Each location attends based on image content.
- Unified Across Tasks — One backbone serves classification, detection, and segmentation.
Why Attention Won Vision
Per scaling regime, attention beat convolution once data and compute were large enough. Convolution's inductive bias helped at small scale but capped accuracy at large scale. Attention's flexibility wins as the data and model size grow.
Why The Stack Is Now Unified
Per modality, attention is the common primitive. Vision, language, audio, and video all share the same operator. Multimodal grounding, the thing that lets Google connect an image to a query to a caption to a page, follows directly.
<\/section>What This Means for SEO
What This Means for SEO
Google's vision stack reads images holistically. Attention-based vision relates every region of an image to every other region and weighs them by content, so the meaning of a page's imagery is parsed with far more nuance than convolutional models allowed. That changes how images, alt text, and visual assets should be treated.
- Image Understanding Is Now Holistic — Attention vision relates the logo, the product, the caption, and the background of an image to each other. The image's meaning emerges from those relationships, not from any single region. Generic stock photography with no internal relationships reads as low-signal.
- Visual Search And Lens Read Content, Not Pixels — Google Lens, Image Search, and image-pack ranking sit on attention vision. The model is identifying what is in the image and what those things mean together. Product photos, screenshots, and infographics with clear subjects get ranked on substance.
- Alt Text And Captions Are Grounding Signal — Multimodal grounding ties image attention to surrounding text. Alt text, captions, and the paragraph nearest the image enter the same representation space. They are not optional metadata, they are how the image's meaning gets anchored.
- Image-Rich Pages Get Richer Understanding — Pages with original product imagery, charts, and process diagrams give the attention model more to parse. Convolutional models extracted limited features from such images. Attention models extract structured meaning, and that meaning supports ranking.
- Content-Dependent Receptive Fields Punish Generic Imagery — Stand-alone attention adapts what it looks at to what is in the image. A stock photo with no distinct content gives the model nothing distinctive to attend to. Imagery with clear subjects, brand assets, and context-rich composition wins.
- Brand And Product Detection Is Now Default — Object detection over attention backbones is part of the same architecture as classification. Logos, packaging, and product shapes in images are parsed and associated with the page. Visible brand assets in imagery contribute to entity association.
- One Vision Stack Across Every Task — The same attention backbone serves classification, detection, segmentation, and video. Signal from one task informs others. An image identified as a product in detection also lifts classification, captioning, and multimodal grounding for that page.