Ashish Vaswani's HaloNet patent. Replaces global self-attention with blocked local attention plus a haloed neighborhood, making attention-based vision backbones tractable at production scale and unlocking Transformer ranking for images and long documents.
Patent Overview
- Inventor
- Ashish Vaswani, Niki Parmar, Prajit Ramachandran, Aravind Srinivas, Blake Hechtman, Jonathon Shlens
- Assignee
- Google LLC
- Filed
- 2021-06-14
- Granted
- Published 2021-12-16
The Challenge
The Challenge
Global self-attention compares every position to every other position. For an image of H by W pixels, that is quadratic in H times W. The cost explodes long before image resolution reaches anything useful. Convolutions scale, but lack the content-adaptive routing that attention provides. The problem is keeping attention's expressiveness while paying a linear, not quadratic, bill.
- Global Attention Is Quadratic In Sequence Length — Per layer, cost grows with the square of the number of positions. Images and long documents become infeasible.
- Convolutions Scale But Lack Content Routing — Per filter, weights are fixed per position. Convolutions are linear but cannot route attention based on what the content actually is.
- Naive Local Windows Lose Cross-Block Context — Per block, isolated windows cannot see neighbors. Objects and phrases that cross window boundaries get cut in half.
- Vision Backbones Need Both Locality And Scale — Per backbone, the system needs receptive fields that grow with depth, like CNNs, without paying global attention cost.
- Production Inference Demands Linear Cost — Per query at serving time, cost must stay linear in input size for ranking and visual search to be economically viable.
Innovation
How The System Works
The system partitions the input into non-overlapping blocks. Each query position attends only within its block plus a surrounding halo of neighbor pixels. The halo is the overlap zone that prevents block boundaries from cutting context. Stacking these layers gives growing receptive fields at linear cost, producing a vision backbone (HaloNet) where blocked local attention replaces convolution.
- Partition Input Into Blocks — Per input, divide the feature map into non-overlapping query blocks of size b by b.
- Define A Halo Around Each Block — Per block, extend the memory region by h pixels on each side. The halo is the context that surrounds the block.
- Compute Local Self-Attention — Per query position, attend to keys and values inside the block-plus-halo neighborhood only.
- Add Relative Position Encodings — Per attention head, use relative position biases so spatial structure is preserved inside each neighborhood.
- Stack Layers For Growing Receptive Field — Per layer, receptive field grows. Deeper layers see effectively wider context through composition.
- Downsample Between Stages — Per stage, strided attention or pooling reduces resolution and grows semantic abstraction, like a CNN backbone.
- Produce Backbone Features — Per input, the network outputs a feature pyramid usable for classification, detection, or ranking heads.
Blocks Plus Halos Make Attention Linear
The patent's load-bearing idea is the haloed block. Pure block attention is too brittle at boundaries. A halo of overlapping context restores cross-block information flow while keeping cost linear in input size, not quadratic.
Locality With Overlap
Per query, attention is local. Per block, the halo overlaps with neighbors so context leaks across boundaries instead of being cut.
- Blocked Attention — Per block, attention is contained to a local neighborhood.
- Halo Overlap — Per boundary, halos carry context across block edges.
- Linear Cost — Per layer, compute grows linearly with input size, not quadratically.
Technical Foundation
Technical Foundation
The patent specifies the block partitioner, halo extender, local attention module, relative position encoder, stage downsampler, and backbone composer.
- Block Partitioner — Per input, splits the feature map into non-overlapping query blocks.
- Halo Extender — Per block, gathers neighboring pixels into the memory region.
- Local Attention Module — Per query, attends inside block-plus-halo only.
- Relative Position Encoder — Per head, encodes spatial offsets so geometry is preserved.
- Stage Downsampler — Per stage, reduces resolution and grows channel depth.
- Backbone Composer — Per architecture, stacks stages into a vision backbone (HaloNet).
The Process
The Process
Training and inference exploit local attention's linear cost to scale vision Transformers to ImageNet-class accuracy and beyond.
- Configure Block And Halo Size — Per stage, pick b and h to balance receptive field against compute.
- Initialize Backbone — Per architecture, initialize attention weights and relative position tables.
- Train On Labeled Images — Per epoch, optimize classification or pretext loss end to end.
- Validate Across Resolutions — Per resolution, confirm linear cost and accuracy scaling.
- Deploy Backbone — Per task, attach detection, segmentation, or ranking heads.
- Serve At Production Cost — Per inference, linear cost makes serving feasible at scale.
- Iterate On Block Geometry — Per generation, refine block and halo sizing for better accuracy-cost trade-offs.
Quality Control
Quality Control
Local attention requires care at boundaries and across stages. The patent specifies safeguards.
- Halo Sizing — Per stage, halo must be large enough to bridge block boundaries without bloating cost.
- Relative Position Consistency — Per head, relative position tables cover the full block-plus-halo range.
- Receptive Field Audit — Per depth, verify effective receptive field grows enough for the task.
- Cross-Stage Alignment — Per stage transition, block geometry aligns with downsampling factor.
- Accuracy-Cost Sweep — Per configuration, sweep block and halo sizes to find efficient operating points.
Real-World Application
HaloNet demonstrates that attention-based vision backbones can match and beat ConvNets at comparable compute. The patent's local-attention primitive underpins the Swin Transformer family, long-context language models, and Google's multimodal ranking stack.
- Linear cost Scaling Property — Compute grows with input size, not its square.
- Block + halo Core Primitive — Local attention with overlapping context.
- Vision + language Application Surface — Same primitive scales images and long documents.
Why Local Attention Won The Scaling Race
Per generation, models that scale linearly out-iterate models that scale quadratically. Local attention is the primitive that made Transformer vision and long-context language feasible at production cost.
Why The Halo Is The Clever Bit
Per boundary, naive block attention loses information. The halo is the small overlap that preserves cross-block context for nearly free. It is the design choice that makes the whole approach work.
<\/section>What This Means for SEO
What This Means for SEO
HaloNet is the scaling primitive that lets Google apply Transformer-based understanding to images and long documents at production cost. It is the reason visual search, image-rich SERP features, and long-form ranking can use attention-based models in the first place.
- Long Documents Now Get Attention Treatment — Local attention scales linearly, so production ranking can run Transformer-style understanding across full long-form pages. Depth and structure across a long article now influence ranking instead of being truncated.
- Visual Search Runs On Vision Transformers — Google Lens, image packs, and shopping visual search use vision Transformer backbones built on this local-attention primitive. Image quality, composition, and on-image text are read by attention-based models, not legacy CNN classifiers.
- Local Windows Mean Nearby Context Counts — Each query position attends within its block plus halo. For SEO, this means surrounding paragraphs, nearby headings, and adjacent on-page elements directly shape how a sentence is interpreted. Context must be local and coherent, not just present somewhere on the page.
- Halos Reward Section Cohesion — The halo is overlap context across block boundaries. Content where sections flow into each other coherently lets signal leak in the right direction. Disjointed sections, by contrast, get cut at boundaries even with overlap.
- Image Neighborhoods Feed Cleaner Local Windows — Vision Transformers process images block by block with haloed context. Images surrounded by relevant alt text, captions, and adjacent body copy feed cleaner local windows. Stock images dropped into unrelated sections produce noisy windows.
- Production Ranking Is Transformer-Native — Without local attention, attention-based ranking would be infeasible at Google scale. This patent is the production mechanism that makes it tractable. Assume the system reading your content is a real Transformer at full capacity, not a downgraded approximation.
- Visual Assets Are Now First-Class Ranking Inputs — Product imagery, infographics, video thumbnails, and diagrams are read by vision Transformers at production scale. Visual content quality, clarity, and alignment with surrounding text feed the same ranking pipeline as words. Visual asset SEO is no longer a side concern.