Predicts the quality score of a new site that has not yet accumulated enough behavioral data by mapping its content phrases against a phrase model derived from baseline-scored sites, transferring quality estimates from established sites to comparable new ones.
Patent Overview
- Inventor
- Navneet Panda
- Assignee
- Google LLC
- Filed
- 2014-03-12
- Granted
- 2017-09-19
- Application Number
- US 14/206,776
The Challenge
New Sites Have No Behavioral Data Yet
The site quality score requires query and click data to compute. New sites have not yet accumulated that data, so the score is undefined. Ranking new sites without a quality score either applies a generic neutral score (which over-promotes potentially low-quality new entrants) or no score (which under-ranks legitimate new sites). The system needs a way to predict quality for sites without ground-truth data.
- Cold Start Is A Real Problem — Every new site faces a period where behavioral signals are too sparse to support a quality score. During this period the site has no quality signal at all unless the system can predict one.
- Content Phrases Reveal Site Character — The phrases a site uses in its content are observable from day one and correlate with the site's quality character. A site whose phrase usage matches that of established high-quality sites is likely to be high quality itself.
- Phrase Model From Baseline Sites — By analyzing the phrase distribution of sites with established baseline quality scores, the system can build a model that maps phrase usage patterns to quality predictions.
- Prediction Bridges The Data Gap — Until behavioral data accumulates, the prediction stands in as the site's quality estimate. Once enough data accumulates, the behavioral score takes over.
Innovation
Phrase Model Maps New Sites To Quality Predictions
The system obtains baseline quality scores for multiple previously scored sites. It generates a phrase model that maps phrase-specific relative frequency measures to phrase-specific baseline quality scores. For a new site without behavioral data, it computes the new site's relative phrase frequencies and applies the phrase model to predict the site's quality. The predicted quality stands in for the missing behavioral score during the site's cold-start period.
- Collect Baseline Sites — Gather a set of sites that have already accumulated enough behavioral data to have stable quality scores. These become the training corpus for the phrase model.
- Compute Per-Phrase Frequencies — For each baseline site, compute relative frequencies of phrases in its content. The frequencies form the input features.
- Build Phrase Model — Construct a model that maps phrase-frequency profiles to quality scores. The model can be a linear regression, a probabilistic model, or a learned classifier.
- Apply To New Site — For a new site, compute its relative phrase frequencies and run them through the phrase model. The model outputs a predicted quality score.
- Use Prediction In Ranking — Until the new site accumulates behavioral data, ranking uses the predicted quality score.
- Switch To Behavioral When Available — Once the site has accumulated enough query and selection data to compute its behavioral score directly, the behavioral score replaces the prediction.
Phrase Patterns As Quality Predictor
The patent recognizes that quality has a phrase signature. The vocabulary and phrasing patterns of high-quality sites differ from low-quality sites in detectable ways, and those differences are observable in content alone.
Content Phrases Are A Predictor, Not A Ground Truth
Behavioral data remains the ground truth quality signal. The phrase model is a bridge that fills the gap for new sites until ground truth accumulates.
- Phrase Frequency Profile — Per-site distribution of phrase relative frequencies. Captures the site's vocabulary character.
- Phrase Model — Mapping from phrase-frequency profiles to predicted quality scores. Trained on baseline sites.
- Prediction As Bridge — Predicted quality stands in until behavioral data accumulates. Transitions to behavioral signal once enough data is available.
Technical Foundation
Phrase Model Construction
Two stages: train the model on baseline sites; apply it to new sites.
- Baseline Site Set — Sites with established behavioral quality scores. The training corpus for the model.
- Phrase Frequency Measure — Per-phrase relative frequency in the site's content. Captures vocabulary character.
- Phrase-Quality Mapping — Model output: predicted quality score from a phrase-frequency profile.
Key Insight: Quality sites tend to share vocabulary patterns. Spammy or thin sites tend to share different vocabulary patterns. The signal is strong enough that pure content analysis can predict quality for new sites without any behavioral data. The phrase model encodes this learned correlation between content and quality.
<\/section>What This Means for SEO
What This Means for SEO
Predicted site quality affects new sites disproportionately. Understanding the phrase-model mechanism informs how new sites should approach content from day one.
- New Sites Are Judged By Content Patterns — Before your site has accumulated behavioral data, its content phrase patterns determine its quality prediction. Vocabulary and topical depth from launch day influence how the site ranks during the cold-start period.
- Mimicking Quality-Site Vocabulary Helps — If your content phrase patterns resemble those of established high-quality sites in your niche, the phrase model predicts a higher quality score for you. Read the leaders in your category; write content that uses the same vocabulary depth.
- Thin Content Is A Detectable Pattern — Thin sites with repetitive or shallow phrasing produce phrase profiles that the model recognizes. Depth and topical breadth in content production produce phrase profiles that look like quality sites.
- Cold-Start Penalty Has An Exit — Once your site accumulates real behavioral data, the behavioral score takes over from the prediction. The predicted quality is a starting position, not a permanent label.
- Audience-Defined Vocabulary Compounds With Brand Search — Content using your audience's actual vocabulary tends to attract that audience's searches. The phrase-model prediction lifts you while you build the brand-search demand that feeds the behavioral score later.