Identifies resources (entities, documents, products) referenced within content by analyzing both organic prose and structured-data markup, enabling resource-aware retrieval beyond literal keyword matching.
Patent Overview
- Inventor
- Srinivasan Venkatachary
- Assignee
- Google LLC
- Filed
- 2011-06-30
- Granted
- 2014-04-01
- Application Number
- US 13/174,180
The Challenge
The Challenge
Web content references resources (entities, products, documents) in many forms: prose mentions, named entity uses, Schema.org markup, microformat data, internal links. Identifying which resources a page is about requires reading both organic content and structured signals together.
- Prose Mentions Need Entity Recognition — Resources are referenced in prose using natural language. Identifying them requires entity recognition that handles names, aliases, references.
- Structured Data Provides Clean Signal — Schema.org markup, microformats, and structured metadata identify resources explicitly. The signal is clean but not always present.
- Combining Both Adds Coverage — Pages with structured markup give explicit identification; pages without it require entity recognition from prose. Combining both maximizes coverage.
- Resource References Vary In Strength — A passing mention is weaker than a primary subject. The system must weight references by their strength and centrality to the page.
- Index Must Be Resource-Indexed — Once resources are identified, the index must support resource-based retrieval. Per-resource posting lists enable resource-aware queries.
Innovation
How The System Works
The system runs entity recognition on prose content, parses structured data markup separately, combines the resource identifications with strength weights, builds per-page resource records, and indexes resources for retrieval queries that target specific resources.
- Run Entity Recognition On Prose — Entity recognizer scans organic content for resource references. Output is candidate resource mentions with surface forms.
- Parse Structured Data Markup — Schema.org, microformat, and other structured markup are parsed. Output is explicit resource identifications with structured attributes.
- Resolve To Canonical Resources — Both prose mentions and structured identifications resolve to canonical resource IDs in the knowledge graph. Disambiguation handles ambiguous names.
- Weight By Reference Strength — Each identification gets a strength weight: primary subject, secondary mention, passing reference. Structured markup typically scores high; prose mentions vary.
- Build Per-Page Resource Record — Per page, accumulate all identified resources with weights into a structured record. The record represents the page's resource footprint.
- Index By Resource — Per resource, build posting lists of pages referencing it. Resource-indexed retrieval supports queries targeting specific resources.
- Serve Resource Queries — When queries target specific resources (entity queries, product queries), retrieval reads the resource index to find pages strongly associated with the resource.
Prose Plus Structured Identification
The patent's load-bearing combination is entity recognition on prose plus structured-data parsing. Either alone is incomplete; together they cover the full range of how pages reference resources.
Resource Is The Atom
Treat resources (entities, products, documents) as first-class index atoms. Pages become collections of resource references rather than just text blobs. Resource-aware retrieval follows naturally.
- Entity Recognition From Prose — Natural-language mentions of resources are extracted via entity recognition. Surface forms resolve to canonical IDs.
- Structured Data Parsing — Schema.org and microformat markup provide explicit, clean resource identification. Pages with strong markup are easily indexed.
- Weighted Combination — Identifications combine with strength weights. Primary subjects rank above passing mentions in the per-page resource record.
Technical Foundation
Technical Foundation
The patent specifies the entity recognizer, the structured-data parser, the resolution layer, the strength-weighting model, the per-page record, and the resource-indexed retrieval.
- Entity Recognizer — Neural recognizer identifies entity mentions in prose. Handles names, aliases, and disambiguation. Outputs surface forms with confidence.
- Structured Data Parser — Parses Schema.org, microformats, RDFa, JSON-LD. Outputs explicit resource identifications with structured attributes.
- Resource Resolution Layer — Maps surface forms to canonical resource IDs in the knowledge graph. Disambiguation uses surrounding context.
- Strength Weighting — Per identification, computes a strength weight based on position, frequency, structural prominence, and reference type. Primary subjects weight high; passing mentions weight low.
- Per-Page Resource Record — Structured record of all identified resources for the page, with strength weights and provenance (prose vs structured).
- Resource-Indexed Retrieval — Posting lists per resource enable retrieval targeting specific resources. Standard inverted-index techniques applied at resource granularity.
The Process
The Process
The pipeline runs as part of indexing. Per crawled page, entity recognition and structured-data parsing both run; output feeds the per-page resource record and the resource index.
- Crawl And Parse Page — Crawler ingests the page. Parser extracts both prose content and structured-data markup.
- Run Entity Recognition — Entity recognizer scans prose for resource mentions. Output is candidate mentions with surface forms and confidence.
- Parse Structured Data — Structured-data parser extracts explicit resource identifications from markup. Output is structured records with attributes.
- Resolve To Canonical IDs — Both prose and structured identifications resolve to canonical resource IDs. Ambiguous cases disambiguate via context.
- Apply Strength Weights — Per identification, strength weight is computed. Output is the weighted resource record.
- Update Per-Page Record — The page's resource record updates with the new identifications. Old identifications retire if no longer present.
- Update Resource Index — Per resource, posting list updates with the new page reference. Resource-indexed retrieval becomes more complete.
Quality Control
Quality Control
Wrong resource identification produces wrong retrieval. The patent specifies safeguards.
- Entity Recognition Confidence Threshold — Low-confidence prose mentions are excluded. Wrong identifications would pollute the resource index.
- Structured Data Validation — Markup is validated against schema. Malformed or spammy markup is excluded from contribution to the resource record.
- Disambiguation Strictness — Ambiguous mentions require strong contextual signal to resolve. Weak disambiguation defaults to no identification rather than wrong identification.
- Weight Calibration — Strength weights are calibrated against engagement outcomes. Wrong weights would produce ranking issues; calibration aligns weights with empirical importance.
- Spam Filtering — Pages with markup spam (irrelevant resource claims) are demoted or excluded from resource indexing. Spam protection is critical for index quality.
Real-World Application
Resource identification underpins how Google indexes entities and products across the web, enabling Knowledge Panel triggering, product search, and entity-aware retrieval across surfaces.
- Dual-source Identification Method — Both prose entity recognition and structured-data parsing contribute. Coverage spans pages with and without markup.
- Strength-weighted Reference Quality — Primary subjects weight high; passing mentions weight low. Per-page records reflect resource centrality.
- Resource-indexed Retrieval Model — Per-resource posting lists enable retrieval queries targeting specific entities, products, or documents.
Why Schema Markup Is A Discoverability Lever
Pages with explicit Schema.org markup contribute clean, high-strength resource identifications to the index. Strong markup coverage compounds discoverability across entity and product queries.
Why Entity-Centered Pages Win Resource Queries
Pages centered on a single entity (with the entity as primary subject) score high on strength weighting and rank well in resource queries. Pages with the entity as a passing mention rank far lower despite mentioning it.
<\/section>What This Means for SEO
What This Means for SEO
The patent identifies the resources (entities, products, documents) a page is about by combining entity recognition on prose with structured-data parsing, weighting each identification by strength. SEO implication: explicit Schema.org markup plus entity-centered prose together make your page a high-confidence target for resource-aware retrieval.
- Schema Markup Is A Discoverability Lever — Pages with explicit Schema.org markup contribute clean, high-strength resource identifications to the index. Strong markup coverage compounds discoverability across entity and product queries, so structured data is direct visibility work.
- Entity-Centered Pages Win Resource Queries — Pages with an entity as the primary subject score high on strength weighting and rank well in resource queries. Pages where the entity is a passing mention rank far lower. Center each page on its primary resource.
- Prose And Markup Are Read Together — The combination of entity recognition on prose plus structured-data parsing covers the full range of how pages reference resources. Aligning your prose and your markup so both name the same resources reinforces a strong, consistent identification.
- Strength Weighting Rewards Prominence — Resource identifications carry strength weights. A resource named in the title, headings, and throughout the body weighs heavily; one buried in a footnote weighs little. Make your primary resource prominent across the page.
- Resources Are Index Atoms — The system treats resources as first-class index atoms, indexing pages as collections of resource references. Thinking of your pages as being about specific resources, and making those explicit, aligns with how resource-aware retrieval works.
- Markup Disambiguates Ambiguous Mentions — Structured data resolves which specific entity a name refers to. For ambiguous names, markup that pins the exact entity prevents misidentification and makes you the confident retrieval target for that resource.
- Comprehensive Coverage Builds Resource Records — The system builds per-page resource records from combined signals. Covering a resource thoroughly in both prose and markup produces a richer record, strengthening your standing on queries targeting that resource.