Optical character recognition framed as statistical machine translation. SEO implication: scanned text (books, PDFs, image-embedded text) is read into the index via SMT-derived OCR, making image-bound content discoverable.
Patent Overview
- Inventor
- Ashok Popat, others
- Assignee
- Google LLC
- Filed
- 2009-07-09
- Granted
- Published research; Google patent family active
The Challenge
The Challenge
Classical OCR uses character classifiers trained per glyph. Errors compound. Statistical machine translation has decades of robust modeling for noisy-channel decoding. Reframing OCR as translation from image-space to character-space lets OCR borrow SMT's modeling power.
- Per-Character Classification Is Brittle — Per character, isolated classification fails on degraded scans, ligatures, and rare glyphs.
- SMT Handles Noisy Channels — Per noisy input, SMT decoding maximizes joint language-model and channel-model probability.
- Image-To-Text Is A Translation Task — Per image patch sequence, decoding to a character sequence mirrors language translation.
- Language Models Anchor Output — Per output sequence, a strong language model rejects implausible character runs.
- Cross-Language OCR Becomes Tractable — Per language, the SMT framework swaps language and channel models without retraining the whole pipeline.
Innovation
How The System Works
The system extracts image features per glyph region, decodes to character sequences via a translation-style decoder, applies a language model over candidate outputs, and selects the joint maximum-likelihood transcription.
- Segment Image — Per page image, segment into line and glyph-region candidates.
- Extract Image Features — Per glyph region, extract robust image features.
- Channel Model — Per (image feature, character) pair, channel model gives translation probability.
- Language Model — Per candidate character sequence, language model scores fluency.
- Joint Decoding — Per page, decoder finds character sequence maximizing channel times language probability.
- Post-Process — Per output, apply normalization, spell-correction, layout reconstruction.
- Index — Per OCR result, push recognized text into the search index.
OCR Borrows From Translation
The load-bearing idea is that OCR is structurally a translation problem. SMT's joint decoding plus language modeling transfers cleanly.
Joint Channel + Language Modeling
Per output, the joint score combines visual-channel likelihood and linguistic plausibility.
- Channel Model — Per glyph region, visual features map to character probabilities.
- Language Model — Per character sequence, linguistic priors score fluency.
- Joint Decoding — Per page, maximum-likelihood character sequence selected.
Technical Foundation
Technical Foundation
The system specifies the image segmenter, feature extractor, channel model, language model, joint decoder, and post-processor.
- Image Segmenter — Per page image, line and glyph segmentation.
- Feature Extractor — Per glyph region, robust visual features.
- Channel Model — Per (image, character), translation probability.
- Language Model — Per character sequence, fluency scoring.
- Joint Decoder — Per page, MAP decoding.
- Post-Processor — Per output, normalization, spell-correction, layout reconstruction.
The Process
The Process
Per page image, OCR runs as a translation pipeline.
- Ingest Image — Per scan, image ingested.
- Segment — Per image, line and glyph segmentation.
- Feature Extract — Per region, features extracted.
- Channel Score — Per region, character candidates scored.
- Language Score — Per sequence, language model scored.
- Decode — Per page, joint MAP decode.
- Index — Per OCR text, push to search index.
Quality Control
Quality Control
Bad OCR pollutes the search index. The patent family specifies safeguards.
- Confidence Thresholding — Per character, low-confidence outputs flagged for review.
- Language Model Recalibration — Per language, LM refreshed on representative corpora.
- Channel Recalibration — Per scanner / font / era, channel model recalibrated.
- Layout Validation — Per page, layout reconstruction validated.
- Ground-Truth Audits — Per sample, OCR output checked against ground truth.
Real-World Application
Translation-inspired OCR underpins Google Books indexing, Lens text capture, Drive PDF search, and any image-text retrieval surface.
- SMT-derived OCR Methodology — Per page, decoded via channel + language model.
- Per-language Model Coverage — Per language, channel + LM swapped.
- Indexable Output — Per OCR text, pushed into search index.
Why Image-Embedded Text Is Indexable
Per scan, SMT-derived OCR converts image bytes to indexable characters. Image-bound text becomes searchable content.
Why Multi-Language OCR Scales
Per language, the SMT framework reuses the decoding scaffold and swaps models. Coverage scales without architectural rewrites.
<\/section>What This Means for SEO
What This Means for SEO
Image-bound text is not invisible to Google. Translation-inspired OCR transcribes scanned documents, image-embedded text, and infographic content into the index. SEO implication: text in your images counts, and text in others' images about you also counts.
- Image-Embedded Text Gets Indexed — Text inside JPGs, PNGs, scanned PDFs, and infographics is OCR'd into the search index. Critical text inside images becomes discoverable, not invisible.
- Don't Bury Important Text In Images — Image-only text loses HTML-level signals (heading hierarchy, semantic markup, anchor potential). Keep load-bearing text as HTML; use images for the visual content.
- Alt Text Plus OCR Compound — Alt text and image-content OCR are independent signals that both feed the index. Strong alt text plus legible image text together describe the asset more completely.
- Infographics Are Searchable Content — An infographic's labels, statistics, and captions are OCR'd. Well-labeled infographics with legible typography earn discovery beyond their visual appeal.
- Scanned PDFs Are Crawled As Text — Old documents, manuals, and book scans become first-class search results once OCR'd. The Google Books and Drive search experiences both lean on this pipeline.
- Language Models Reject Garbage — The language-model layer rejects implausible OCR outputs. Stylized fonts that confuse OCR may not surface garbled text but they may not surface useful text either. Legible typography wins.
- Image SEO Is Multi-Channel — Filename, alt text, surrounding caption, OCR'd image content, structured-data references, and ViT-style image embedding all feed the system together. Image SEO is the sum of these channels, not any single one.