A single neural network trained jointly across multiple tasks and multiple modalities including images, audio, and text. The architectural ancestor of MUM and Gemini, and the foundation for unified multimodal search.
Patent Overview
- Inventor
- Noam Shazeer, Ashish Vaswani, Aidan Gomez, Lukasz Kaiser, Niki Parmar, Jakob Uszkoreit, Llion Jones, Illia Polosukhin
- Assignee
- Google LLC
- Filed
- 2019-11-19
- Granted
- Published 2020-03-19
The Challenge
The Challenge
Conventional machine learning systems train one model per task and one model per modality. Speech models do not help translation models; image models do not help parsing models. The challenge: build a single model that handles many tasks across many modalities and lets each task benefit from the others.
- One Model Per Task Is Wasteful — Per task, a separate model duplicates representation work that other tasks already solved.
- Modalities Are Siloed — Per modality, separate pipelines prevent any transfer between image, audio, and text.
- Low-Data Tasks Underperform — Per task, scarce training data limits quality with no way to borrow from data-rich tasks.
- Heterogeneous Inputs And Outputs — Per task, inputs and outputs differ in shape, scale, and vocabulary. A single body must accept all.
- Capacity Versus Specialization — Per task, dedicating capacity to specialization is expensive; sharing capacity risks interference.
Innovation
How The System Works
The system defines modality-specific input and output nets that translate each modality into a shared representation space, then routes everything through a single shared body composed of convolutional, attention, and sparsely-gated mixture-of-experts blocks. The same shared parameters serve every task.
- Define Modality Nets — Per modality, an input net maps raw audio, image, or text into a shared embedding.
- Define Shared Body — Per architecture, a single body of conv, attention, and MoE blocks processes all modalities.
- Define Output Modality Nets — Per modality, an output net maps shared representations back into task-specific outputs.
- Mix Block Types — Per layer, conv handles local structure, attention handles long-range, MoE provides sparse capacity.
- Joint Multi-Task Training — Per training step, examples from all tasks update the same shared body parameters.
- Cross-Task Transfer — Per task, gradients from data-rich tasks improve representations used by data-scarce tasks.
- Single Model Serves All — Per inference, the same weights handle ImageNet, speech, translation, captioning, and parsing.
One Network, Many Tasks, Many Modalities
The patent's load-bearing idea is that a single shared body with modality-specific input and output nets can be trained jointly on tasks as different as image classification and machine translation, and that joint training transfers benefit between them.
Unified Representation Space
Per modality, the input net produces tokens in a shared space. Per task, the output net reads from that same space. The body is modality-agnostic.
- Modality Nets — Per modality, input and output adapters.
- Mixed Block Body — Per layer, conv plus attention plus MoE.
- Joint Training Transfer — Per task, gradients flow through shared body.
Technical Foundation
Technical Foundation
The patent specifies modality input nets, modality output nets, a shared body, mixed block types, joint training, and the multi-task loss.
- Modality Input Nets — Per modality, raw input is encoded into the shared space.
- Modality Output Nets — Per modality, shared representations decode into task outputs.
- Shared Body — Per model, one body processes all tasks and modalities.
- Mixed Block Types — Per layer, conv plus attention plus sparsely-gated MoE.
- Joint Multi-Task Loss — Per training step, losses from all tasks combine.
- Sparsely-Gated Experts — Per token, MoE gating activates specialized experts without ballooning compute.
The Process
The Process
Training mixes examples from every task; inference routes inputs through the appropriate modality net but always through the same body.
- Define Task Suite — Per deployment, tasks across modalities are selected.
- Instantiate Modality Nets — Per modality, input and output adapters are built.
- Build Shared Body — Per architecture, conv plus attention plus MoE blocks are stacked.
- Mix Training Batches — Per batch, examples from multiple tasks are mixed.
- Backprop Through Body — Per step, gradients from every task update shared parameters.
- Evaluate Per Task — Per task, validation tracks quality including low-data tasks.
- Deploy Single Model — Per inference, the same weights serve every task and modality.
Quality Control
Quality Control
Joint training across heterogeneous tasks creates interference risks. The patent specifies safeguards.
- Task Balancing — Per training, task sampling ratios prevent dominant tasks from crowding out others.
- Modality Isolation — Per modality, dedicated input and output nets prevent representational collisions.
- MoE Load Balancing — Per expert, auxiliary losses keep capacity from collapsing onto a few experts.
- Transfer Verification — Per task, ablations confirm that joint training helps rather than hurts.
- Per-Task Metrics — Per task, quality is tracked independently across the full task suite.
Real-World Application
MultiModel demonstrates that one network can simultaneously handle ImageNet image classification, WSJ speech recognition, English-German and English-French translation, image captioning on COCO, and parsing on Penn Treebank. Low-data tasks like parsing improve when trained jointly with high-data tasks like ImageNet.
- 8 tasks Joint Training Scope — Image, audio, and text tasks share one body.
- 1 model Unified Architecture — Same parameters serve every modality.
- Cross-modal transfer Quality Pattern — High-data tasks lift low-data tasks.
Why Unified Models Win Long-Term
Per task suite, a unified model amortizes representation learning across all tasks. Each new task added compounds the value of the shared body rather than starting from scratch.
Why This Is The Ancestor Of MUM And Gemini
Per generation, the unified multimodal pattern scales. MUM and Gemini are direct descendants. The 2017 MultiModel architecture is the load-bearing prior art for the multimodal search era.
<\/section>What This Means for SEO
What This Means for SEO
MultiModel establishes that a single network can rank, understand, and generate across text, image, audio, and structured signals. Modern AI search is built on this pattern, so SEO must be planned around a ranker that sees every modality at once.
- Signals Transfer Across Modalities — A unified model means improvements to image understanding raise the bar for text understanding and vice versa. Optimizing one surface implicitly optimizes adjacent surfaces, so neglecting image alt, video transcripts, or schema leaves cross-modal lift on the table.
- One Representation Space For All Content — Text, image, audio, and structured data share a common embedding space inside the model. A page is judged by its full multimodal footprint, not by text alone, so coherence across formats becomes a ranking input rather than a stylistic choice.
- Cross-Task Quality Compounds — Joint training means advances in translation, speech, and vision propagate into ranking quality. The ranker keeps getting smarter from work in adjacent fields, so the quality bar moves even when the search team ships nothing new.
- The MUM And Gemini Lineage — This patent is the architectural ancestor of MUM, Gemini, and the AI Overviews stack. Strategies built for unified multimodal models age better than strategies built for the pre-2017 single-modality ranker.
- Multi-Format Content Compounds — A single piece that serves text, image, and audio queries is evaluated by the same model across all three. Multi-format publishing is not three separate plays but one compounding asset inside the unified ranker.
- Sparse Experts Reward Specialization — The sparsely-gated MoE blocks route specialized inputs to specialized experts without ballooning cost. Deep, narrowly specialized content reaches specialized expert capacity, which explains why niche depth still wins at scale.
- Quality Signals Flow Into AI Surfaces — Because foundation models are trained jointly, SEO quality signals propagate into AI Overviews, SGE, and Gemini-based answer surfaces. There is no separate optimization track for generative search; the same content quality drives both ranking and synthesis.