Extracts (entity, attribute, value) triples from unstructured text at web scale, turning prose like 'Tesla was founded in 2003 by Martin Eberhard' into structured facts the knowledge graph can store, query, and verify.
Patent Overview
- Filed
- 2019-07-05
- Granted
- 2022-03-01
- Application Number
- US 16/504,114
The Challenge
The Challenge
Most of the web's factual knowledge sits in prose. The knowledge graph cannot consume prose directly; it needs structured facts. Extracting (entity, attribute, value) triples reliably from natural language at scale is the long-standing challenge this patent addresses.
- Prose Hides Structured Facts — Articles assert facts in natural-language sentences. A human reader extracts the structured triples instantly; an automated system needs language understanding to do the same.
- Surface Patterns Are Too Brittle — Pattern-based extractors ('X was founded in Y') break on syntactic variation. Real text uses many constructions for the same fact, and patterns cannot enumerate them all.
- Attributes Are Open-Ended — Unlike a fixed schema, the set of possible attributes grows with the world. The system needs to handle novel attributes, not just predefined ones.
- Values Have Variable Types — An attribute's value can be a date, a number, a name, a place, or a structured object. The extractor must classify the value type as well as extract its text.
- Confidence Calibration Is Essential — Many extracted triples are uncertain. The system must produce calibrated confidence so downstream consumers can decide which to trust.
Innovation
How The System Works
The patent trains a model that takes sentences plus candidate (entity, attribute) pairs and outputs the value plus a confidence. Training uses authoritative sources where the triple is already known, so the model learns to recover triples from prose given many parallel examples.
- Build Labeled Training Data — Authoritative sources (Wikipedia infoboxes, Wikidata) provide (entity, attribute, value) triples plus sentences that assert them. Each triple becomes a labeled training example.
- Train Joint Extraction Model — A neural model learns to predict the value given the sentence and the (entity, attribute) pair. The joint formulation handles many attributes with one model.
- Run Over The Web Corpus — At indexing time, the model processes sentences from crawled documents. For each candidate (entity, attribute) pair, it predicts the value and confidence.
- Aggregate Across Documents — The same triple is often asserted by many documents. Aggregation across documents produces a consensus value and an aggregated confidence.
- Reconcile Conflicting Assertions — When sources disagree on a value, the system reconciles using source authority, recency, and consensus. The winning value enters the graph.
- Track Provenance — Every triple stores the supporting documents. Provenance enables audit, correction, and confidence updates when sources are revised.
- Update Continuously — New documents bring new assertions. The pipeline runs continuously so the graph stays current with the web's latest factual claims.
Triple Extraction At Web Scale
The patent's load-bearing idea is to treat triple extraction as a learned task with massive supervised training data from authoritative sources. The model generalizes from labeled examples to extract triples from new prose.
Learn From Labeled Examples, Apply To Web
Authoritative sources provide millions of labeled triple-sentence pairs. Training a model on this data and applying it to the open web yields structured facts at a scale no hand-built extractor could match.
- Joint Model — One model handles many attributes by taking the attribute as input. Avoids the explosion of per-attribute models.
- Cross-Source Aggregation — Many sources asserting the same triple raise confidence. Reconciliation handles disagreement systematically.
- Provenance Tracking — Every triple's supporting documents are stored. Provenance enables audit and correction.
Technical Foundation
Technical Foundation
The patent specifies the training data construction, the model architecture, the inference pipeline, the aggregation layer, and the provenance store.
- Labeled Training Construction — Authoritative-source triples are aligned with sentences that mention them. Distant supervision creates the labeled training corpus at scale.
- Joint Neural Model — Transformer or LSTM model takes (sentence, entity, attribute) as input and outputs value plus confidence. Single model serves all attributes.
- Inference Pipeline — Sentences from crawled documents stream into the inference layer. For each candidate triple, the model produces a prediction.
- Aggregation Layer — Predictions from many documents for the same triple aggregate into a consensus value with combined confidence.
- Reconciliation Logic — When values disagree, the system applies source authority, recency, and consensus rules to pick the canonical answer.
- Provenance Store — Per-triple, the supporting document list is stored alongside the value. Audits and corrections trace back to sources.
The Process
The Process
The pipeline runs continuously as part of the indexing path. Each crawled document contributes triple assertions; the graph updates as evidence accumulates.
- Document Enters Pipeline — A crawled document streams through the analyzer alongside standard indexing.
- Sentence Segmentation — The document is split into sentences. Each sentence is a candidate triple source.
- Entity Tagging — Entities in each sentence are tagged with canonical IDs. Triples are inferred relative to tagged entities.
- Run Joint Model — For each (entity, attribute) candidate, the model predicts the value and confidence. Predictions stream to the aggregation layer.
- Aggregate And Reconcile — Predictions for the same triple aggregate across documents. Reconciliation resolves conflicts.
- Update Graph — High-confidence triples enter the knowledge graph. Lower-confidence ones stay in the candidate pool until reinforced.
- Maintain Provenance — Each graph triple's supporting documents are recorded. Audits and corrections trace back through the provenance chain.
Quality Control
Quality Control
Triple extraction risks producing wrong facts at scale. The patent specifies safeguards to keep the extracted facts reliable.
- Confidence Threshold — Only high-confidence triples merge into the graph. Low-confidence ones wait for further evidence.
- Source Authority Weighting — Triples backed by high-authority sources outweigh same triples from low-authority sources. Authority signal is core to reconciliation.
- Conflict Detection — When sources disagree, the system flags the conflict. Editorial review can resolve important disagreements; algorithmic reconciliation handles routine ones.
- Provenance Audit — Random audits sample triples and verify against sources. Audit results refine the confidence calibration.
- Correction Channel — Users and entity owners can correct wrong triples. Corrections feed back into the pipeline to refine the underlying model.
Real-World Application
Entity-attribute extraction powers most of Google's structured-fact features: Knowledge Panel facts, voice-assistant factoids, AI Overview groundings, and the structured-data feed that newer assistant products consume.
- Millions Triples Per Update — Each refresh extracts millions of triples from new and revised documents. The graph grows continuously as the web evolves.
- Joint Model Architecture — One model serves many attributes via joint formulation. Avoids per-attribute model proliferation.
- Tracked Provenance — Every triple stores its supporting documents. Provenance supports audit, correction, and authority weighting.
Why Definitional Sentences Are Extractable Currency
Short, declarative sentences asserting entity-attribute-value structure are the easiest for the model to extract. Pages that include such sentences early on (founding dates, locations, prices, key attributes) contribute their facts to the graph more reliably.
Why Tables And Structured Data Outperform Prose
Tables that pair attributes with values, and Schema.org markup that asserts the same triples explicitly, bypass the model entirely. The extraction is pre-done. Pages with strong structured-data coverage feed the graph with maximum efficiency.
<\/section>What This Means for SEO
What This Means for SEO
When the system extracts entity-attribute pairs from text, your content becomes data the engine can use directly in answers.
- Subject-Verb-Object Sentences Are Mined — Sentences that cleanly state entity-relationship-value are the easiest to extract. Write definitional sentences in this form, especially in the first paragraph of each section.
- Tables And Lists Are Extraction-Friendly — Tables that pair attributes with values are gold for entity extractors. A specs table on a product page is more extractable than the same data in prose.
- Consistent Attribute Vocabulary Wins — Use the same attribute names across your content set (e.g., always say "founded" rather than mixing "founded", "established", "since"). Consistency makes you the trusted source for that attribute.