Builds an index of quoted statements attributed to recognized entities and makes those quotes searchable by speaker, topic, time, or context, so the engine can return 'what did X say about Y' as a direct quote rather than as a document scan.
Patent Overview
- Filed
- 2014-04-21
- Granted
- 2019-02-05
- Application Number
- US 14/257,866
The Challenge
The Challenge
Quotes attributed to public figures, experts, and organizations are widely repeated across the web, but document search treats them as ordinary text. Users wanting 'what did X say about Y' get a document list, not the quote. The system needed a dedicated quote-indexing layer.
- Quotes Are A Distinct Retrieval Object — A quote is a self-contained statement with a speaker and an utterance. Treating it as a typed object enables retrieval that document search cannot match.
- Attribution Is Often Implicit — Quotes appear with attribution patterns like 'said John Smith', '"...," Smith remarked', or 'According to Smith,'. Extracting the speaker reliably requires pattern detection plus entity resolution.
- The Same Quote Repeats Across The Web — Influential quotes propagate. Deduplication and canonicalization are needed so the system tracks the quote, not its many copies.
- Topic And Time Filter Meaningful Queries — Users want quotes by topic, period, or context. The index must support these filter dimensions, not just speaker lookup.
- Quote Authenticity Needs Verification — Misattribution is rampant. The system must weight quotes by source authority so high-authority attributions dominate over low-authority ones.
Innovation
How The System Works
The pipeline detects quoted statements in crawled documents, identifies the speaker via attribution-pattern extraction and entity resolution, canonicalizes quotes across duplicates, indexes them by speaker plus topic plus time, and supports retrieval by any combination of those facets.
- Detect Quoted Spans — A quote detector identifies spans wrapped in quotation marks or signaled by reporting verbs (said, claimed, noted). Spans become candidate quotes.
- Extract Attribution — An attribution extractor identifies the speaker from surrounding text. Patterns like 'said X', 'X said', 'according to X' anchor the speaker identification.
- Resolve Speaker To Entity — The speaker string is resolved to a canonical entity ID via the entity recognizer. The quote is then tied to a known person, organization, or other entity.
- Classify Topic And Context — The quote text is analyzed for topical categories. Surrounding article context provides additional signal: the article's topic, date, and angle inform the quote's context.
- Canonicalize Across Duplicates — Quotes that appear multiple times across documents are merged. The canonical record tracks all source documents and computes an authority score.
- Index By Facets — Canonical quotes are indexed by speaker entity, topic, time range, and source authority. Multi-faceted retrieval supports rich queries.
- Surface At Query Time — When a user asks 'what did X say about Y', the index returns matching quotes ranked by authority, recency, and topical fit. Source documents are linked for verification.
Quote As First-Class Object
The patent's load-bearing idea is to extract quotes from text and treat each one as a structured object with speaker, statement, topic, time, and source. The structure enables retrieval document search cannot match.
Speaker Plus Statement Plus Context
A quote is not just text. It carries a speaker, a statement, a context, and a provenance. Capturing all four dimensions makes quotes queryable in ways their source documents are not.
- Attribution Extraction — Pattern-based extractors identify the speaker from surrounding text. Robust extraction across many writing styles is what makes the pipeline work at scale.
- Canonicalization — The same quote repeating across many documents collapses to one canonical record with provenance from all sources.
- Faceted Indexing — Speaker, topic, time, and source authority are all indexed. Users can filter and combine facets in queries.
Technical Foundation
Technical Foundation
The patent specifies the quote detector, the attribution extractor, the canonicalization layer, the multi-facet index, and the query interface.
- Quote Detector — Detects quoted spans using quotation marks, reporting verbs, and discourse patterns. Output is candidate quote spans with confidence scores.
- Attribution Extractor — Identifies speaker spans from surrounding context. Uses linguistic patterns and entity-recognition models trained on quote-attributed text.
- Entity Resolver — Maps speaker strings to canonical entity IDs. Disambiguation handles common names; entity context narrows ambiguity.
- Canonicalization Layer — Near-duplicate detection merges variant phrasings of the same quote. Canonical records track all source documents.
- Faceted Index — Multi-facet index supports filtering by speaker, topic, time range, and source authority. Built on top of distributed inverted-index infrastructure.
- Authority Scoring — Quotes from high-authority sources (primary reporting, official channels) outrank quotes from low-authority republishers. Source authority is a per-document attribute.
The Process
The Process
The pipeline runs as part of the indexing path. Each crawled document is analyzed for quotes; the canonical quote store updates continuously.
- Document Enters Indexing — A crawled document is analyzed alongside standard indexing. The quote pipeline runs as a parallel analyzer.
- Detect Candidate Quotes — The detector finds all quoted spans and reporting-verb statements in the document. Each becomes a candidate.
- Extract Attribution Per Candidate — For each candidate, the attribution extractor identifies the speaker. Candidates without resolvable attribution are dropped.
- Resolve Speaker — Speaker strings resolve to entity IDs. Unresolved speakers are logged for entity-pipeline review.
- Canonicalize — The new quote is compared to the canonical store. Near-duplicates merge into existing records; novel quotes create new records.
- Update Indexes — Faceted indexes update to reflect the new canonical record. Authority scores recompute as new sources cite the quote.
- Serve At Query Time — Quote-search queries hit the faceted index. Results are scored, filtered, and rendered with source attribution.
Quality Control
Quality Control
Quote search risks surfacing misattributions, out-of-context quotes, or fabricated statements. The patent specifies safeguards.
- Source Authority Weighting — Quotes from high-authority sources dominate. Low-authority republishers contribute less to canonical scoring even when they repeat the quote.
- Attribution Confidence Threshold — Quotes with low attribution confidence are excluded from the index. Better to omit than to surface a misattributed quote.
- Cross-Source Consensus — When multiple high-authority sources agree on a quote's speaker, confidence rises. Disagreement triggers manual review for important quotes.
- Out-Of-Context Detection — Quotes excerpted in ways that change their meaning are flagged. The system prefers in-context renderings with surrounding sentences.
- Manual Correction Channel — Speakers can correct misattributed quotes via verified-profile flows. Corrections feed back into the canonical store.
Real-World Application
Quote search underpins features like 'about author' panels with quoted excerpts, voice-assistant responses to 'what did X say about Y', and SERP cards surfacing quoted statements alongside news articles.
- Multi-facet Indexing Dimensions — Speaker, topic, time, source authority. Each is indexed and queryable in combination.
- Canonicalized Duplicate Handling — Quotes repeated across many sources collapse to one canonical record with provenance tracking.
- Source-weighted Ranking Approach — Quotes from high-authority sources outrank quotes from low-authority ones, mitigating the propagation of misattributions.
Why Quotable Sentences Earn Visibility
A sharp, self-contained statement attributed to a recognized expert gets pulled into the quote index. Interview-style content with clear attribution surfaces in quote-search results long after the original article fades.
Why Attribution Markup Matters
Using blockquote with cite, structured data for Quotation type, and clear inline attribution makes quote extraction cleaner. Pages with explicit attribution markup contribute their quotes to the index more reliably than pages with implicit attribution.
<\/section>What This Means for SEO
What This Means for SEO
Entity-quote search retrieves quotes attributed to a person or organization, so quotable content with clear attribution earns a discovery surface.
- Quotable Sentences Earn Discovery — A sharp, self-contained sentence attributed to a recognized entity gets retrieved for entity-quote queries. Plant such sentences in every interview-style piece.
- Attribution Markup Helps Extraction — When you mark quoted text with the speaker (using semantic HTML or schema), extraction is cleaner. <blockquote cite> is more than visual styling, it is signal.
- Entity Authority Boosts Quote Visibility — Quotes from already-authoritative entities get surfaced more. Linking quotes to canonical entity profiles (Wikipedia, Wikidata) reinforces the attribution.