I (Bob) am starting to feel like a participant in an early 20th century epistolary academic exchange (e.g. that between Mr. Russell and Mr. Strawson).

In a comment to John Langford’s response to my blog entry recapitulating his comments after his talk, Kuzman Ganchev points out that he and Mark Dredze did the empirical legwork in their 2008 paper:

- Ganchev, Kuzman and Mark Dredze. 2008. Small Statistical Models by Random Feature Mixing. In
*Proceedings of the ACL-2008 Workshop on Mobile Language*.

To summarize, we’re considering the effect on classification accuracy of using hash codes (modulo some fixed *n*) of feature representations as dimensional identifiers rather than requiring a unique identifier for each of *m* underlying features. The reason this is interesting is that it requires much less memory and far less computational effort to compute features this way. The reason it might be problematic for accuracy is that there’s an increasing likeliood of collisions as the number of parameters *n* decreases.

For instance, character *n*-gram hash codes may be computed at a cost of only a couple assignments and arithmetic operations per character by the Karp-Rabin algorithm, which only implicitly represent the *n*-grams themselves. For Latin1 (ISO-8859-1) encoded text, Karp-Rabin can even be computed online with binary input streams (`java.io.InputStream`

) without the expense of decoding Unicode characters. With *n*-gram-based features, input streams are usually effective with other character encodings. More complex features may be handled by simple modifications of the Karp-Rabin algorithm.

Ganchev and Dredze showed that for many NLP problems (e.g. spam filtering, 20 newsgroups, Reuters topics, appliance reviews), there is very little loss from drastic reductions in *n* relative to *m*. This is very good news indeed.

Getting back to John Langford’s last post, I would like to answer my own question about how having multiple hash codes per feature helps maintain discriminative power in the face of collisions. It may seem counterintuive, as having more features (2 * *m*) for the same number of parameters (*n*) seems like there will simply be more collisions.

Let’s consider a very simple case where there is a feature *f* which is split into two features *f0* and *f1* without collision. With maximum likelihood, if the original weight for *f* is *β*, then the weights for *f0* and

*f1* and *f2* will be *β/2* (or any linear interpolation). With Laplace priors (L1 regularization), the behavior is the same because the penalty is unchanged abs(*β*) = abs(*β*/2) + abs(*β*/2). But, with Gaussian priors (L2 regularization), the penalty is no longer equivalent, because with *β != 0*, *β*^{2} > (*β*/2)^{2} + (*β*/2)^{2}.

Setting aside regularization, let’s work through an example with two topics with the following generative model (adapted from Griffiths’ and Steyvers’ LDA paper):

Word(feature) | |||
---|---|---|---|

Topic | river(0) | bank(1) | loan(2) |

0 | 1/2 | 1/2 | 0 |

1 | 0 | 1/2 | 1/2 |

So topic 0 is about geography and topic 1 about finance. In topic 0, there is a 50% chance of generating the word "river", a 50% chance of generating the word "bank", and no chance of generating the word "loan". It is easy to identify a set of regression parameters that has perfect classification behavior: *β* = (1,0,-1) [the maximum likelihood solution will not actually be identified; with priors, the scale or variance parameter of the prior determines the scale of the coefficients].

Now what happens when we blow out each feature to two features and allow collisions? The generative model is the same, but each feature is replicated. If there is a collision between the first code for "river" and the first code for "loan", the resulting coefficients look like:

Words(feature) | |||||
---|---|---|---|---|---|

Topic | riverA, loanA(0) | riverB(1) | bankA(2) | bankB(3) | loanB(4) |

Coefficients β |
0 | 1 | 0 | 0 | -1 |

The resulting coefficients again produce a perfect classifier. The collision at code 0 is simply a non-discriminative feature and the split versions pick up the slack.

If the collision is between a discriminative and non-discriminative feature code, there is still a perfect set of coefficients:

Words(feature) | |||||
---|---|---|---|---|---|

Topic | riverA(0) | riverB, bankA(1) | bankB(2) | loanA(3) | loanB(4) |

Coefficients β |
1 | 0 | 0 | -1/2 | -1/2 |

Of course, real problems aren’t quite as neat as the examples, and as we pointed out, regularization is non-linear except for maximum likelihood and Laplace priors.

In the real world, we typically find "new" features at runtime (e.g. a character 8-gram that was never seen during training). There is a very long tail for most linguistic processes. Luckily, there is also a lot of redundancy in most classification problems.

This problem doesn’t occur in the hermetically sealed world of a fixed training and test set with feature pruning (as in the ECML KDD 2006 Spam Detection Challenge, which distributed data as bags of words with words occurring only if they occurred 4 or more times in the training data).

May 2, 2008 at 10:26 pm |

For what it’s worth, we tried adding multiple differently hashed copies of the features in some preliminary experiments. We didn’t see any improvement over single copies (I think there was a small consistent decrease in performance). I suspect that this is because we were using a relatively small number of slots (at most twice as many as the number of features). Certainly, if m=n, then adding sufficiently many copies will eventually make classification impossible, since all the features will be mixed with each other. Maybe if there are many more slots than features, then adding multiple copies of features would help, since then most of the second copies will land in empty bins.

May 4, 2008 at 4:53 am |

Another thing worth trying: from the nature of NLP tasks, a small core of the predictive features are very common, while most of the features are rare. Maybe allocating different ranges for common and non-common features (making sure common features does not collide interntally and externally) could be beneficial to classification accuracy. This has the possible (negligible?) cost of saving a table of ~300 entries for the common features.

May 6, 2008 at 11:19 am |

You’d need to inspect each feature to see if it’s common. This reintroduces something like a table lookup for every feature to see if it’s a common feature.

In addition to being common, features need to be selected for discriminative power (either by looking at fitted coefficients in a table-based model or by using some kind of filtering approach like information gain or chi-squared tests).

May 6, 2008 at 12:51 pm |

My intuition is that the small set of core predictive features is quite stable, and that a very good approximation of it can be obtained by training on a relatively small sample of the data (in the setting described by Kuzman, where the interest is mostly in test time, these features can be obtained from all the data). Once this feature set is available, a very efficient recognizer can be compiled. So the overhead seem pretty small.

Ofcourse, the actual benefit from this needs to be verified empirically.