Uses anchor text as parallel corpora for cross-language information retrieval. Foundational for multilingual search — anchor text in one language pointing at content in another language creates implicit translation pairs.
Patent Overview
- Inventor
- Monika H. Henzinger, others
- Assignee
- Google Inc.
- Filed
- 2003
- Granted
- 2006-12-05
The Challenge
The Challenge
Cross-language information retrieval needs translation pairs. Manually curated parallel corpora are limited. The web naturally produces translation pairs: anchor text in one language linking to documents in another. Mining these pairs builds a massive parallel corpus for cross-language IR.
- Manual Parallel Corpora Are Limited — Curated translation pairs scale poorly. Web-derived pairs scale naturally.
- Anchor Text Crosses Languages — Per link, anchor text in source-page language; target document in target language. Implicit translation pair.
- Mining Must Generalize — Per link, mining must work across all language pairs.
- Quality Validation Required — Web anchor text is noisy. Validation against linguistic models matters.
- Cross-Language IR Benefits Directly — Per cross-language query, derived parallel corpus enables translation and matching.
Innovation
How The System Works
The system identifies anchor-text-to-target-document pairs across languages, validates pairs against linguistic models, builds parallel corpora per language pair, trains cross-language IR models, and applies models for cross-language search.
- Crawl Link Graph — Crawler discovers anchor-text-to-target-document pairs.
- Detect Language Pair — Per pair, source-text and target-document languages detected.
- Filter For Cross-Language Pairs — Pairs where source and target languages differ retained.
- Validate Translation Quality — Per pair, validate against linguistic models.
- Build Parallel Corpus — Per language pair, validated pairs form corpus.
- Train Cross-Language Models — Per corpus, train translation and IR models.
- Apply In Cross-Language Search — Per cross-language query, models drive translation and ranking.
Web Produces Natural Translation Pairs
The patent's load-bearing idea is that the web's own anchor-link structure produces translation pairs at massive scale. Mining them yields parallel corpora that beat manual curation by orders of magnitude.
Implicit Translation Pairs
Per cross-language link, anchor text in source language and target document in target language form an implicit translation pair. The pattern is structural.
- Anchor-Text Mining — Per link, anchor text mined as translation primitive.
- Language-Pair Detection — Per pair, language combination detected.
- Validated Parallel Corpus — Per language pair, validated pairs build corpus.
Technical Foundation
Technical Foundation
The patent specifies the link miner, language detector, cross-language filter, validator, corpus builder, model trainer, and application layer.
- Link Miner — Per crawl, discovers anchor-text-target pairs.
- Language Detector — Per pair, detects source and target languages.
- Cross-Language Filter — Retains pairs with different languages.
- Validator — Per pair, validates against linguistic models.
- Corpus Builder — Per language pair, builds parallel corpus.
- Model Trainer — Trains translation and cross-language IR models.
The Process
The Process
Mining runs at crawl time; model training runs offline; application runs per query.
- Crawl And Mine — Anchor-text-target pairs discovered.
- Detect Languages — Per pair, languages detected.
- Filter Cross-Language — Cross-language pairs retained.
- Validate — Pairs validated against linguistic models.
- Build Corpus — Parallel corpus built per language pair.
- Train Models — Cross-language IR models trained.
- Apply — Models drive cross-language search.
Quality Control
Quality Control
Web anchor text is noisy. The patent specifies safeguards.
- Linguistic Validation — Per pair, linguistic validation filters noise.
- Frequency Thresholds — Per pair, minimum frequency required.
- Manipulation Detection — Spam-anchor patterns filtered.
- Per-Language-Pair Calibration — Per language pair, models calibrated separately.
- Continuous Refresh — Corpus and models refresh against fresh data.
Real-World Application
Anchor-text-derived parallel corpora underpin modern multilingual search and translation systems. The pattern of web-mined translation pairs is foundational across cross-language IR.
- Web-mined Corpus Source — Per crawl, translation pairs mined from anchor links.
- Per-language-pair Coverage — Each language pair gets its own corpus and models.
- Validated Quality Gate — Linguistic validation filters web noise.
Why Cross-Lingual Anchor Linking Helps Discovery
Pages that earn cross-language anchor links contribute to translation pair generation. Earning anchors from sites in target languages improves the cross-language discoverability infrastructure for your content.
Why Translation Quality Compounds From Source Quality
Per page, content used as translation-pair source must be high quality. Anchor-derived translations inherit source-content quality. Investing in source-language content quality compounds across translation systems.
<\/section>What This Means for SEO
What This Means for SEO
Anchor text in one language pointing at content in another is mined as implicit translation pairs to build parallel corpora for cross-language retrieval. SEO implication: earning cross-language anchor links and keeping source content high quality improves your cross-language discoverability.
- Cross-Language Anchor Links Aid Discovery — Anchor text linking across languages becomes translation-pair data. Earning anchors from sites in your target languages improves the cross-language discoverability infrastructure for your content. Pursue links from the language communities you want to reach.
- Source Content Quality Propagates — Anchor-derived translations inherit source-content quality. Pages used as translation-pair sources should be high quality, because poor source content yields poor translation pairs. Invest in source-language content quality to compound across translation systems.
- Descriptive Anchors Make Better Pairs — The mining relies on anchor text describing the target. Earning descriptive, meaningful anchor text (rather than generic 'click here') produces cleaner translation pairs that better represent your content across languages.
- Multilingual Linking Builds Reach — Mining generalizes across all language pairs. A multilingual link profile contributes pairs across many languages, broadening the cross-language contexts in which your content can surface. Cultivate links from diverse language sources.
- Noisy Anchors Add Little — Web anchor text is noisy and validated against linguistic models. Spammy or irrelevant cross-language anchors do not pass validation. Earn genuine, relevant cross-language links rather than manufactured ones.
- Local-Language Authority Attracts Right Links — To earn anchors from a language community, you need to be relevant and authoritative to it. Building genuine value for a target-language audience attracts the cross-language links that feed the parallel corpus.
- Quality Translation Surfacing Is The Payoff — Derived corpora enable cross-language query matching and translation. High-quality, well-linked content participates in cross-language retrieval, surfacing for queries in languages you did not directly target. This widens your effective reach.