Trains entity recognition models by feeding complete sentences from authoritative sources and learning to predict the entity even when only fragments are given, so the model can identify entities in messy real-world text with the precision of training-time examples.
Patent Overview
- Inventor
- Maxim Gubin, Sangsoo Sung, Krishna Bharat, Kenneth W. Dauber
- Filed
- 2014-12-30
- Granted
- 2016-02-02
- Application Number
- US 14/586,303
The Challenge
The Challenge
Entity recognition is the foundation of every entity-aware search feature, but training a good model is hard. Sparse hand-labeled data limits coverage; surface-pattern rules break on real-world variation. The system needed a way to harvest training examples at scale from clean source text.
- Hand-Labeled Data Does Not Scale — Manually labeling entity mentions in millions of sentences is prohibitively expensive. Models trained on small hand-labeled sets do not generalize to the diversity of real-world text.
- Surface Patterns Break On Variation — Rules like 'capitalized words after "by" are people' fail constantly. Real text is messier than rules can capture, especially across domains and styles.
- Authoritative Sources Have Clean Examples — Wikipedia, structured databases, official sites all link entity mentions to canonical IDs. Each link is a labeled training example, free to harvest if the system knows how to use it.
- Need To Generalize From Whole Sentences To Fragments — Training on complete sentences gives the model strong context. But real-world text often presents entities in fragmentary form: titles, search queries, snippets. The model must transfer to fragments too.
- Models Must Stay Current — New entities appear constantly. The training pipeline must be retrainable as the entity universe grows, so the recognition model keeps pace with the live world.
Innovation
How The System Works
The patent harvests sentences from authoritative sources (Wikipedia, knowledge graph), uses each sentence's known entity links as training labels, trains a model that predicts the entity from surrounding context, and progressively masks more of the context so the model learns to handle short and fragmentary inputs.
- Mine Sentences From Authoritative Sources — Wikipedia and similar sources contain millions of sentences with entity links to canonical IDs. The pipeline scrapes these sentences along with the linked entities to build a labeled training set.
- Use Entity Links As Labels — Each link is a labeled example: 'in this context, this surface phrase refers to this canonical entity'. Aggregated across all source sentences, the labels cover millions of entities and many contexts per entity.
- Train On Full Sentences First — Initial training uses complete sentences with their entity labels. The model learns to map (sentence, span) pairs to canonical entity IDs using rich context.
- Progressively Mask Context — Subsequent training rounds mask portions of the sentence (random words, sentence prefixes, suffixes). The model learns to predict the entity from less context, eventually handling very short inputs.
- Add Negative Examples — The model also sees text spans that are not entities or refer to unrelated entities. This teaches it to abstain when no entity is confidently present.
- Calibrate Confidence — Output probabilities are calibrated so the model's confidence aligns with empirical accuracy. Calibrated confidence supports threshold-based decisions downstream.
- Retrain Periodically — As the entity universe grows and as authoritative sources update, the pipeline retrains on the latest data. The deployed model stays current.
Authoritative Links Become Labels
The patent's load-bearing idea is to harvest entity links from authoritative content and treat each link as a labeled training example. The labor of human editors becomes the labor of an automated training pipeline, at million-sentence scale.
Curated Data, Automated Scale
Wikipedia editors spent years linking entities. The patent borrows that work, multiplies it by automation, and produces training data far beyond what any in-house labeling team could match.
- Link-As-Label — Every entity link in an authoritative source is one labeled training example. The labels are produced by human editorial care; the harvest is automated.
- Progressive Context Masking — Models trained on full sentences and progressively masked variants generalize to short queries, fragmentary text, and noisy real-world inputs.
- Calibrated Confidence — The model's probability outputs match empirical accuracy. Downstream consumers can threshold confidence to decide whether to use a prediction.
Technical Foundation
Technical Foundation
The patent specifies the data harvesting pipeline, the training procedure, the model architecture, and the calibration framework.
- Source Harvest Pipeline — Scrapers extract sentences with entity links from Wikipedia and similar sources. Each sentence becomes a training record with the surface span and the canonical entity ID.
- Training Data Aggregation — Records are aggregated into a labeled training corpus. Entity coverage is monitored to ensure broad representation across types and frequency tiers.
- Model Architecture — Neural models (LSTM in early versions, transformer-based in later) consume the sentence context and output a probability distribution over candidate entity IDs.
- Masking Schedule — Training proceeds through stages: full sentence, partial sentence, single-clause, span-only. Each stage teaches the model to use less context effectively.
- Negative Sampling — For each positive example, several negative examples are constructed: non-entity spans, wrong-entity labels. Negative sampling shapes the abstain behavior.
- Confidence Calibration Layer — A calibration layer maps raw model probabilities to calibrated confidences. Calibration uses held-out evaluation data and updates as the model evolves.
The Process
The Process
The pipeline runs as periodic batch jobs that produce new model versions, plus an online inference layer that consumes the trained models at query time.
- Harvest Latest Source Data — Scrapers refresh sentence and link data from authoritative sources. Change detection identifies new examples since last harvest.
- Build Training Corpus — Sentences and labels are aggregated into the labeled training set. Sampling balances entity types and frequencies.
- Train Base Model — Initial training runs on full sentences. The model learns the basic entity-context mapping.
- Apply Masking Curriculum — Subsequent training stages mask progressively more of the context. Model parameters update so the model handles shorter and shorter inputs.
- Evaluate And Calibrate — Held-out evaluation measures accuracy at multiple input lengths. The calibration layer is fit to align model confidence with empirical accuracy.
- Deploy New Version — The trained model is packaged and deployed to the inference layer. Canary rollout monitors regression before full deployment.
- Schedule Next Retrain — The pipeline cycles. New source data, new model. The deployed model never gets too far behind the live entity universe.
Quality Control
Quality Control
Training data quality determines model quality. The patent specifies safeguards against bad source data and against model regressions.
- Source Quality Audit — Authoritative sources are audited periodically. Sources with degrading link quality are removed from the harvest set.
- Label Sanity Checks — Harvested labels are sanity-checked: extreme frequency imbalances and obvious mislabels are flagged for review before reaching training.
- Held-Out Regression Tests — Each new model version is evaluated on a held-out benchmark. Regressions on benchmark accuracy block deployment.
- Canary Inference Monitoring — New models route a small fraction of traffic first. Anomaly detection on inference outputs triggers rollback before full deployment.
- Confidence Threshold Tuning — Downstream consumers tune their confidence thresholds based on observed accuracy. Calibration updates flow through to threshold adjustments.
Real-World Application
Entity recognition models trained via this pipeline underpin Knowledge Panel triggering, structured data extraction, Search Generative Experience grounding, and many other entity-aware features across Google.
- Millions Training Examples Per Refresh — Authoritative sources supply millions of labeled examples each refresh cycle. The data volume is orders of magnitude beyond what manual labeling could produce.
- Multi-stage Masking Curriculum — Training proceeds through progressive masking stages so the model handles inputs from full sentences down to single phrases.
- Calibrated Confidence Output — Calibration ensures model confidence aligns with empirical accuracy, supporting threshold-based decisions in downstream consumers.
Why Wikipedia Sameness Matters For SEO
Because the entity-recognition pipeline learns from authoritative sources, content that aligns its entity references with Wikipedia phrasing gets recognized more reliably. The patent's pipeline is the technical reason wiki-aligned naming compounds SEO advantage for entity-heavy content.
Why Schema Markup Helps Recognition
Pages that explicitly mark entities with Schema.org markup hand the recognition pipeline labels in their cleanest form. The pipeline can incorporate marked-up content as additional training signal, making structured data a contributor to model quality.
<\/section>What This Means for SEO
What This Means for SEO
When entity-recognition models are continuously trained on web text, the way you mention entities teaches the model what those entities mean.
- Consistent Entity Mentions Train The Model — A piece of content that uses the canonical entity name (and consistent disambiguators) helps the model classify it correctly. Inconsistent mentions are signal noise.
- Disambiguators Matter For Common Names — When an entity shares a name with others (e.g., Apple the company vs. apple the fruit), nearby words disambiguate. Pages that always pair the entity with strong disambiguators rank higher for that sense.
- Schema Closes The Training Loop — Pages that mark entity mentions with structured data are clean training signal. Even if Schema.org markup does not directly rank you, it makes your content easier for the model to understand.