Extracts structured information from form-like documents (invoices, receipts, tax forms, applications) using ML models that combine layout features with text content, turning unstructured document scans into queryable data.
Patent Overview
- Inventor
- Marc Najork, others
- Assignee
- Google LLC
- Filed
- 2019-05-21
- Granted
- 2022-07-19
- Application Number
- US 16/418,690
The Challenge
The Challenge
Forms carry their meaning through layout, not just text. An invoice's total sits in a specific spatial relationship to the rest of the document; a receipt's date is positioned a certain way; a tax form has cells in specific grid positions. Pure text-extraction misses the structure; pure visual-OCR misses the semantics.
- Layout Carries Meaning Forms Do Not State — An invoice's total appears below or to the right of line items. A date appears in a corner. Forms communicate through positional convention; text alone discards this signal.
- Per-Form Templates Do Not Scale — Hand-coding templates for every form variant in the world is impossible. Forms vary wildly across vendors, regions, and document types.
- ML Models Need Both Text And Layout — Modern ML can fuse text content with visual layout features. Models that read both produce extraction quality neither modality could reach alone.
- Form Vocabulary Is Bounded But Diverse — Form fields share types (date, amount, person name, address) but appear under different surface labels across vendors. The model must generalize across phrasings.
- Output Must Be Structured — Downstream consumers need structured records: typed fields, validated values, source-location traceability. Free-text output is insufficient.
Innovation
How The System Works
The patent fuses text-content features with layout features in a neural model trained on labeled forms across many vendor types, learns to identify and extract typed fields with confidence scores, validates extracted values against type constraints, and outputs structured records mapping form fields to canonical schema.
- Ingest Document Image And OCR — The document image enters the pipeline. OCR extracts text plus per-token bounding boxes that capture spatial position.
- Build Joint Text-Layout Representation — Per token, combine text embedding with layout features (position, neighboring tokens, page region). The joint representation captures both modalities.
- Identify Candidate Field Spans — Neural model scans the document for candidate field spans: 'this token sequence might be the invoice total'. Candidates carry type predictions and confidence.
- Classify Field Types — Per candidate, classify into canonical field types (invoice number, date, amount, party name, line item, tax). Type classification handles vendor variation in surface labeling.
- Validate Values Against Type — Extracted values validate against type constraints: dates parse to valid dates, amounts to numbers, names to plausible name patterns. Invalid extractions are flagged or suppressed.
- Resolve Ambiguities Across Fields — Multi-field extractions may conflict: two candidate totals, ambiguous date positions. Cross-field constraints (total = sum of line items) resolve ambiguities.
- Output Structured Record — Final output is a typed structured record with source-location traceability. Downstream systems can query, validate, and audit.
Joint Text-Plus-Layout Extraction
The patent's load-bearing idea is to fuse text content with layout features in one ML model rather than processing them separately. Layout is treated as a first-class feature alongside text, capturing the spatial conventions forms use to communicate.
Forms Communicate Through Position
An invoice total is meaningful because of where it sits, not just what it says. Extraction models that read position alongside text capture this convention.
- Joint Modality Embedding — Text plus layout plus visual features combine into one representation. The model sees position and content together.
- Type-Aware Classification — Per candidate, type classification maps to canonical field types. Vendor surface labeling is normalized to schema.
- Cross-Field Validation — Constraints across fields (total = sum of line items, date plausibility) resolve ambiguities and validate extractions.
Technical Foundation
Technical Foundation
The patent specifies the OCR layer, the joint embedding model, the candidate span extractor, the type classifier, the validation engine, and the structured-record output.
- OCR Layer — Production OCR extracts text plus per-token bounding boxes. Bounding boxes preserve spatial position needed for layout features.
- Joint Embedding Model — Neural model embeds tokens with text features plus layout features (position, neighbors, region). Transformer architectures with spatial-aware attention work well here.
- Candidate Span Extractor — Span-prediction head identifies candidate field spans. Outputs span coordinates and type predictions with confidence.
- Type Classifier — Per candidate, classifies into canonical field types. Trained on labeled examples spanning many vendors and form types.
- Validation Engine — Per extracted value, type-specific validators check plausibility: dates parse, amounts add up, names look like names. Invalid values trigger downstream handling.
- Structured Record Output — Final output is typed structured data with source-location traceability. Format compatible with downstream record processing.
The Process
The Process
The pipeline runs as a document-processing service. Documents arrive, extraction completes within seconds, structured records flow to downstream systems.
- Document Submitted — Document image or PDF arrives at the extraction service. Format-specific preprocessing converts to a standard internal representation.
- OCR And Tokenize — OCR produces tokens with bounding boxes. Tokens become the input to the embedding model.
- Build Joint Embeddings — Per token, build joint text-plus-layout embedding. Embeddings pass to the candidate extractor.
- Extract Candidates — Span-prediction model produces candidate field spans with type predictions and confidence.
- Classify And Validate — Per candidate, type classifier outputs canonical type; validator checks value plausibility. Low-confidence or invalid candidates are dropped or flagged.
- Resolve Cross-Field Constraints — Multi-field consistency (totals, dates, references) resolves ambiguities. Final structured record emerges.
- Deliver To Downstream — Structured record delivers to downstream systems: accounting, expense management, tax processing. Source-location traceability supports audit.
Quality Control
Quality Control
Wrong extraction propagates errors into downstream systems. The patent specifies safeguards.
- Confidence Thresholding — Low-confidence extractions flag for human review or are suppressed. The system prefers no-extraction to wrong-extraction.
- Type Validators — Per-type validators check plausibility. Invalid extractions are caught at validation, not propagated.
- Cross-Field Constraint Checks — Total-equals-sum-of-line-items and similar cross-field constraints catch inconsistent extractions. Inconsistent records flag for review.
- Source Traceability — Every extracted value retains its source position in the original document. Audits can trace errors back to source.
- Continuous Training — Misextractions feed back into training data. The model improves continuously as new form types and patterns appear.
Real-World Application
Form-information extraction underpins Google's document AI products, Workspace document parsing, and the structured-data extraction layer for many Google enterprise products. The primitives generalize to any document-to-data pipeline.
- Multi-vendor Form Coverage — One model handles forms across vendors. Vendor surface labeling normalizes to canonical schema.
- Joint-modality Extraction Method — Text plus layout features fuse in the model. Layout is a first-class feature.
- Structured Output Form — Final output is structured typed records with source traceability. Downstream systems can query and audit.
Why Form-Extraction Generalizes To Web Content
Web pages are forms too: positionally-conventional layouts with semantic fields (product cards, recipe sections, article metadata). The same joint-modality techniques work on web content, informing how Google extracts structured data from pages.
Why Layout-Aware Models Beat Text-Only
For any content where position carries meaning (web pages, forms, documents), models that read layout outperform models that read text alone. The patent's primitives are the technical reason layout-aware extraction has become standard.
<\/section>What This Means for SEO
What This Means for SEO
This patent extracts structured data from form-like documents by fusing text with layout features in one ML model, and the same joint-modality approach applies to web pages. SEO implication: layout and position carry meaning the system reads, so structured, positionally-conventional content (product cards, recipe sections, article metadata) extracts cleanly into structured data.
- Layout Carries Meaning The System Reads — The model treats position as a first-class feature alongside text, because forms and web pages communicate through spatial convention. Presenting data in conventional positions helps the system extract it correctly.
- Layout-Aware Models Beat Text-Only — For any content where position carries meaning, models reading layout outperform text-only extraction. Visually and structurally clear pages are easier for Google to parse into structured data.
- Web Pages Are Forms Too — Product cards, recipe sections, and article metadata are positionally-conventional layouts with semantic fields. Following common layout conventions for your content type aids automated extraction.
- Typed Fields Normalize Across Surface Labels — The model maps varied surface labels to canonical field types. Whether you label a price 'Total' or 'Amount Due', clear typed presentation helps it resolve to the right field.
- Cross-Field Consistency Is Checked — Validation enforces constraints like total equaling the sum of line items. Internally consistent structured content (matching prices, valid dates) extracts reliably; inconsistent data gets flagged.
- Low-Confidence Extractions Are Dropped — The system prefers no extraction to a wrong one, suppressing low-confidence candidates. Ambiguous or cluttered layouts reduce confidence, so clarity directly improves whether your data is captured.
- Clean Structure Enables Rich Results — Reliable structured extraction is what feeds structured-data and rich-result features. Conventional, machine-readable layout for your content type increases the chance of enhanced presentation in search.