An approach to semantic document matching that leverages user activity patterns rather than traditional content analysis, enabling powerful search capabilities in privacy-sensitive environments.
Patent Overview
- Granted
- August 2023
The Challenge
The Challenge
The problem this patent addresses comes from limits in how earlier systems handled the underlying signal. Several specific gaps motivated the new approach.
- Traditional Search Limitations — Internet-based search engines typically rely on two key data sources: user interaction data (click-through rates, click logs) and semantic content analysis of openly accessible documents. However, these approaches fail in environments like cloud storage and private document...
- The Opportunity — While cloud-based file storage platforms and private document systems lack click-through data and full content access, they possess something valuable: detailed and robust document activity logs. These activity logs capture every interaction, when documents are opened...
- Obtain Documents and Activity Logs — The system collects pairs of documents along with their associated activity logs, which record access events including timestamps and access types (opening, editing, sharing, etc.).
Innovation
How The System Works
The patent introduces a multi-step mechanism that turns the input signal into a usable ranking output. Each step builds on the previous one.
- A Novel Training Signal: Document Activity Logs — This disclosure introduces a method for training machine-learned semantic matching models using document activity logs as the primary training signal. Rather than relying on content analysis or click data, the system learns document relationships from user...
- Understanding Document Activity Logs — Document activity logs form the foundation of this training approach. Each log maintains a comprehensive record of interactions with a specific document, capturing the who, what, and when of every access event.
- Scalar Relation Labels — Beyond binary labels (related/not related), the system supports scalar values indicating degree of relatedness : This nuanced approach captures the spectrum of document relationships, from loosely associated to tightly coupled documents.
- Content Embeddings Through N-Gram Selection — A critical innovation ensures user privacy while still leveraging document content: the system generates embeddings using only high-frequency character subsets (n-grams), making individual document content indecipherable.
- Enhanced Privacy Through Selective Content Processing — The system's approach to content embeddings provides inherent privacy and security advantages, particularly important for confidential document repositories.
Relation Label Formula
A single load-bearing idea anchors the entire patent. Understanding it makes the rest of the design follow naturally.
- Relation Label Formula — For each document pair, the relation label Y_d,d' is defined as: where coaccesses(d,d') represents the number of co-accesses between documents d and d' within the activity segment.
Technical Foundation
Technical Foundation
The implementation rests on a specific set of components and data structures. These are the parts the patent claims and the engineering that ties them together.
- Server Computing System — Hosts the cloud-based file storage platform, maintains document repositories and activity logs, and implements the trained semantic matching model for search operations. Can operate as a single server or distributed system with multiple server computing...
- Semantic Similarity Computation — The trained model computes semantic similarity values between documents within and across clusters, revealing relationships that weren't captured by initial clustering criteria.
- Beyond Text: Images, Video, and Metadata — While the detailed examples focus on textual content, the system architecture supports multiple content modalities, enabling comprehensive document matching across diverse file types.
- Technical Impact — By leveraging the rich behavioral data inherent in document activity logs, this approach enables sophisticated semantic matching in environments where traditional methods fail, opening new possibilities for intelligent document organization and retrieval...
The Process
The Process
In production, the system executes a sequence of stages from query reception to result delivery. Each stage applies one transformation to the data.
- Query Submission — An user enters a search query through an user device (smartphone, desktop, terminal). The query can be text, an image (for reverse image search), or other content types.
- User Presentation — Ranked results are presented to the user through the search interface, enabling efficient document discovery.
- Training Data Collection and Model Deployment — Practical implementation involves careful consideration of data collection strategies, training infrastructure, and deployment architectures.
Quality Control
Quality Control
The system includes checks that defend against edge cases, manipulation, and degraded signal. Without these, the core mechanism would be exploitable.
- Train Semantic Matching Model — Document pairs are input to a machine-learned model that generates semantic similarity values. A loss function evaluates the difference between predicted similarity and the relation label, enabling model training.
- Deploy for Search Operations — The trained model can then rank search results by computing semantic similarity between search queries and candidate documents, without requiring access to full document content.
Real-World Application
The patent shapes how the search engine behaves in production. These are the visible outcomes for users and content publishers.
- Example Application — Consider a collected document set D = {d1, d2, d3, d4}. The extracted co-access labels might be: The complete training dataset T is then collected from N activity segments: T = ∪_{i=1}^N {(d, d'...
- Result Ranking — Documents are ranked according to their semantic similarity values, with the most semantically similar documents appearing first in the results.
- Generate Relation Labels — Based on temporal proximity of access events, the system determines whether document pairs are related. If two documents are accessed within a threshold time window (e.g., k minutes), they receive...
What This Means for SEO
What This Means for SEO
When the system uses real user activity to train its semantic matcher, the documents that get engaged with become the training data for what counts as a good match.
- Engagement Is Training Data — Pages users actually open, scroll, and dwell on define what the model learns about query-document fit. Bounce-rich pages teach the model the opposite of what you want it to learn.
- Activity Patterns Define Topical Anchors — The activity logs reveal which documents anchor a topic in users' minds. Becoming an anchor document means a long, sustained engagement curve, not a viral spike.
- Implicit Feedback Beats Implicit Optimization — You cannot game activity logs at scale, so the model trusts them more than on-page signals. The content has to actually serve the user, not just appear to.