Identifies document language via the link context surrounding inbound anchors. Multilingual retrieval primitive — the language a page is in can be inferred from the language patterns of pages linking to it.
Patent Overview
- Inventor
- Trystan G. Upstill, others
- Assignee
- Google LLC
- Filed
- 2012
- Granted
- 2015-08-04
The Challenge
The Challenge
Document language identification typically relies on document content. But content-only identification fails on short documents, mixed-language documents, or documents with technical content where language is ambiguous. Link context — the language of pages linking to a target — provides complementary signal.
- Content-Only Identification Fails On Edge Cases — Per document, short content, mixed-language content, or technical content can defeat content-only language ID.
- Link Context Provides External Signal — Per inbound link, source-page language indicates which language community references the target.
- Multilingual Documents Are Common — Per document, multilingual pages need language assessment.
- Aggregation Across Links Denoises — Per document, aggregating source-language across many inbound links denoises signal.
- Content And Link Signals Combine — Per document, both signals combine for richer language identification.
Innovation
How The System Works
The system identifies source-language per inbound link, aggregates across inbound links per target document, combines with content-derived language signal, and produces composite language identification.
- Identify Source-Page Language — Per source page of inbound link, identify source language.
- Capture Link Context — Per inbound link, language of source page captured.
- Aggregate Across Inbound Links — Per target document, source-language distribution aggregated.
- Compute Content-Derived Language — Per target document, content-language identification runs.
- Combine Signals — Per document, link-context plus content-derived signals combine.
- Produce Language Identification — Per document, composite language identification output.
- Feed Into Multilingual Retrieval — Language identification feeds multilingual retrieval and ranking.
Link Context Complements Content
The patent's load-bearing idea is that link-context language signal complements content-derived signal. Aggregate source-language across inbound links produces robust identification where content alone fails.
Combined Signal Beats Single Source
Per document, content + link-context together produce richer identification.
- Source-Page Language — Per inbound link, source language captured.
- Aggregate Across Links — Per target document, source-language distribution aggregated.
- Content + Link Combination — Both signals combine for composite identification.
Technical Foundation
Technical Foundation
The patent specifies the source-language identifier, link-context capturer, aggregator, content-language identifier, combiner, and retrieval integrator.
- Source-Language Identifier — Per source page, language identified.
- Link-Context Capturer — Per inbound link, source language captured.
- Aggregator — Per target document, source-language distribution aggregated.
- Content-Language Identifier — Per document, content language identified.
- Combiner — Per document, content + link signals combined.
- Retrieval Integrator — Language ID feeds multilingual retrieval.
The Process
The Process
Per document, language identification runs at indexing time.
- Identify Source Languages — Per source page, language identified.
- Capture Link Context — Per inbound link, source language captured.
- Aggregate Per Target — Per target document, source-language aggregated.
- Content-Language Identify — Per target, content language identified.
- Combine — Composite language identification computed.
- Cache — Per document, language ID cached.
- Apply — Multilingual retrieval consumes ID.
Quality Control
Quality Control
Wrong language identification corrupts multilingual retrieval. The patent specifies safeguards.
- Source-Language Accuracy — Per source page, language ID validated.
- Aggregate Threshold — Per document, minimum inbound-link count for link-context contribution.
- Combined-Signal Validation — Per document, combined identification validated against held-out data.
- Multilingual-Document Handling — Per document, multilingual content recognized separately.
- Continuous Recalibration — Models refresh against fresh data.
Real-World Application
Link-context language identification underpins multilingual retrieval at web scale. The pattern of source-language aggregation complementing content-language signal informs modern multilingual search.
- Multi-source Signal Combination — Content and link signals combine.
- Per-document Granularity — Each document gets composite language ID.
- Aggregate Robustness Pattern — Aggregating source-language across inbound links denoises signal.
Why Cross-Lingual Link Patterns Matter
Pages linked from other-language sources signal cross-language relevance. Per document, link context reveals which language communities engage the content.
Why Clear Per-Page Language Signals Win
Per document, clear single-language content (or clearly delimited multi-language sections) produces strong identification. Mixed-language without clear separation degrades signal.
<\/section>What This Means for SEO
What This Means for SEO
A document's language is inferred partly from the language of pages linking to it, aggregated across inbound links, complementing content-based detection. SEO implication: clear single-language content and links from the right language communities reinforce correct language identification.
- Keep Each Page In One Clear Language — Content-only detection fails on mixed-language pages, and link context then has to compensate. Single-language pages, or cleanly delimited language sections, produce strong, unambiguous identification. Avoid blending languages on one page.
- Inbound Links Signal Your Language Community — The source-page language of your inbound links indicates which language community engages you. Earning links from sites in your target language reinforces that you serve that audience and language.
- Cross-Lingual Links Signal Cross-Language Relevance — Links from other-language sources mark cross-language relevance. If you serve multiple language audiences, links from each language community help the system understand your multilingual reach.
- Aggregation Rewards Consistency — Language signal is aggregated across many inbound links to denoise. A consistent inbound-link language profile produces a clean signal; a noisy, contradictory one weakens it. Build links from the communities you actually serve.
- Help Content Detection On Edge Cases — Short or technical pages can defeat content-only language ID. For thin or jargon-heavy pages, explicit language signals and same-language inbound links matter more, since content alone gives the system little to work with.
- Content And Link Signals Should Agree — The two signals combine for richer identification. Content language and inbound-link language pointing the same way produces the most reliable result; a mismatch creates ambiguity. Keep them aligned.
- Localize To Earn Local Links — Genuinely localized content earns links from local-language sources, which in turn reinforces correct language identification. Localization is both a content and a link-acquisition strategy for multilingual discoverability.