Builds multilevel hierarchical document taxonomies using Fisher discrimination values to pick the most informative features at each level of the hierarchy, producing classification trees more discriminating than flat single-level approaches.
Patent Overview
- Inventor
- Prabhakar Raghavan
- Assignee
- IBM Corporation
- Filed
- 1998-06-24
- Granted
- 2001-05-15
- Application Number
- US 09/103,328
The Challenge
Flat Classification Cannot Scale To Big Taxonomies
Classifying documents into a deep topic hierarchy (Yahoo Directory, DMOZ, Google's category trees) needs more than flat one-versus-rest classifiers. A flat approach uses the same features at every level which is inefficient and ineffective: the features that distinguish 'science vs sports' are different from the features that distinguish 'physics vs chemistry'. The system needs a hierarchical classifier that selects different features at each level of the tree, optimized for the distinctions that level actually needs to make.
- One Feature Set Cannot Discriminate At Every Level — Top-level distinctions (broad topics) use different cues than mid-level distinctions (subtopics) or leaf-level distinctions (specific entities). Using the same feature set everywhere wastes discriminating power.
- Need Per-Node Feature Selection — Each internal node of the taxonomy makes its own routing decision (which child to send the document to). Each routing decision benefits from features chosen specifically for that decision.
- Fisher Discrimination Is The Right Selector — Fisher's linear discriminant value measures how well a feature separates two distributions. It is well-suited to picking features that maximally distinguish the children of a node.
- Hierarchy Has To Be Maintainable — The taxonomy evolves: new topics appear, others merge, documents move. The classifier has to support maintenance operations without rebuilding the whole tree.
- Trade-Off Between Tree Depth And Per-Node Power — Deeper trees mean more nodes each with weaker per-node signal. The Fisher selection has to identify features strong enough to discriminate even at deep nodes where data is sparse.
Innovation
Fisher Values Pick Features Per Node
At each internal node of the taxonomy, the system computes Fisher discrimination values for candidate features (tokens, terms, phrases, dates) on the training documents that belong to each child. Features with the highest Fisher values are selected as the routing features for that node. A new document is routed by evaluating those features and following the strongest child match. Different nodes use different feature sets, each tuned to its own discrimination task.
- Start With A Taxonomy — Begin with a topic hierarchy that has nodes at multiple levels and training documents assigned to nodes. The hierarchy can be hand-built initially.
- For Each Internal Node, Identify Training Sets — Within each internal node, identify the training documents that belong to each of its children. These training sets are what the per-node classifier will discriminate between.
- Compute Fisher Discrimination Per Feature — For each candidate feature (term, phrase, date), compute its Fisher discrimination value: the ratio of between-class variance to within-class variance across the children's training sets.
- Select Top Discriminating Features — Pick the features with the highest Fisher values as the routing features for this node. The selected feature set is unique to this node and to the children it routes between.
- Build Routing Classifier — For each node, build a classifier that uses the selected features to decide which child a new document belongs to. Standard classifiers (naive Bayes, logistic regression) work on the selected feature subset.
- Route New Documents — A new document is routed by entering at the root, evaluating the root's selected features, picking the best child, and recursing. The path through the tree assigns the document to a leaf topic.
- Maintain As Hierarchy Changes — When the taxonomy changes (new topics, merged topics, moved documents), recompute Fisher values only for affected nodes. The hierarchical structure makes maintenance cheap.
Per-Node Feature Selection With Fisher
Where flat classifiers use one feature set for all decisions, this hierarchical classifier picks fresh features at each node based on Fisher discrimination. Each node becomes a specialized classifier tuned to its own routing decision.
Different Features For Different Distinctions
The features that distinguish broad topics differ from the features that distinguish subtopics. A hierarchical classifier should exploit this by picking features per node rather than globally.
- Fisher Value — Ratio of between-class variance to within-class variance. High value means the feature separates the classes cleanly.
- Per-Node Selection — Each internal node picks its own top-Fisher features. The selection is local to the routing decision the node makes.
Hierarchical classification gains accuracy by exploiting hierarchical structure in the feature space.
<\/section>Technical Foundation
Fisher Discrimination Mathematics
Fisher's linear discriminant is a well-studied feature-selection method. The patent's contribution is applying it per node in a hierarchy.
- Between-Class Variance — For a feature and a set of classes, how different the per-class means are. Larger values mean the feature distinguishes the classes well.
- Within-Class Variance — For a feature, how much it varies within each class. Smaller values mean the feature is consistent within a class.
- Fisher Value — Between-class variance divided by within-class variance. High values flag features that distinguish classes cleanly with low within-class noise.
- Per-Node Feature Set — The subset of features with the highest Fisher values for the children of a particular node. Used by that node's classifier and not shared with other nodes.
Quality Metrics
- Fisher Discrimination — Higher values mean the feature is a better discriminator among the children of a given node. The selected feature set is the top-k by this measure.
F(f, C) = var_between(f, C) / var_within(f, C)
Key Insight: The hierarchical-per-node approach is what makes large-scale topic taxonomies feasible. Flat classification at the leaf level would require enormous feature vectors and training data. Hierarchical classification decomposes the problem so each node only solves the local discrimination task with locally-optimal features. The Fisher selection is a clean way to make 'locally optimal' precise and computable.
<\/section>The Process
Building And Maintaining The Classifier
The classifier is built offline against a labeled training set and maintained as the taxonomy evolves.
- Training Documents Labeled By Taxonomy — Begin with a corpus of documents labeled by their position in the taxonomy. Labels can come from editorial curation or crawled directory data.
- Per-Node Training — For each internal node, gather its training documents grouped by child. These are the inputs to feature selection.
- Fisher Computation — Compute Fisher discrimination per candidate feature. Sort features by Fisher value. Select the top-k as routing features.
- Classifier Construction — Build a classifier per node using the selected features. Standard text classifiers (naive Bayes, logistic regression) work well.
- Routing Test — Validate the classifier on held-out documents. Adjust the number of selected features and the classifier choice based on validation accuracy.
- Production Deployment — Deploy the hierarchy of per-node classifiers. New documents are routed by walking the tree from root to leaf using each node's classifier.
- Incremental Maintenance — When the taxonomy changes (additions, deletions, merges), recompute Fisher values and classifiers only for affected nodes. Most of the tree is preserved across updates.
What This Means for SEO
What This Means for SEO
Hierarchical topic classification is one of the quiet inputs to how search engines understand what your page is about. Knowing the per-node feature selection mechanism shapes how to write for both broad and narrow topical positioning.
- Top-Level Topical Markers Matter — The features that distinguish your page at the broadest level (industry, domain) need to be unmistakable. Generic, ambiguous content fails the top-level discriminator and gets routed into wrong branches of the taxonomy.
- Subtopic Specificity Comes Next — Once routed into the right top-level branch, the features that distinguish your subtopic from sibling subtopics matter. Use the vocabulary specific to your sub-niche, not just the broader industry terms.
- Different Levels Use Different Words — Top-level features are broad category words; deeper features are precise jargon. Pages should include both: broad terms for routing into the right branch, precise terms for distinguishing within it.
- Avoid Cross-Topical Drift — Pages that mention vocabulary from multiple unrelated top-level branches confuse the routing at the root and risk being misclassified. Stay within a topical lane unless you intentionally target a cross-cutting topic.
- Niche Vocabulary Beats Generic Vocabulary For Deep Branches — If your target is a deep node in a topic tree, the high-Fisher features at that level are specific to the niche. Using only generic top-level vocabulary fails the per-node discrimination at deep levels.
- Authoritative Sources Use The Right Vocabulary At The Right Level — Genuinely authoritative pages naturally use both broad and specific vocabulary appropriately. Their natural language patterns produce high Fisher values at the levels they target.