Monthly Archives: April 2015

The Single Seed Hypothesis

This article shows that it is often possible to find the vast majority of the relevant documents in a collection by starting with a single relevant seed document and using continuous active learning (CAL). This has important implications for making review efficient, making predictive coding practical for smaller document sets, and putting eyes on relevant documents as early as possible, perhaps leading to settlement before too much is spent on document review. It also means that meticulously constructing seed sets and arguing about them with opposing counsel is probably a waste of time if CAL is used.

In one of the sessions at the ACEDS 2014 conference Bill Speros advocated using judgmental sampling (e.g., keyword search) to find relevant documents as training examples for predictive coding rather than using random sampling, which will not find many relevant documents if prevalence is low. I thought to myself that, while I agree with the premise, he should really warn about the possibility of bias and the fact that any probability estimates or relevance scores generated by the classification algorithm could be very distorted. To illustrate the problem of bias I decided that I would do an experiment when I got home. I would start with a single relevant training document and see how many of the relevant documents the system could find. I expected it to find only a subset of the relevant documents that were similar to the single seed document, showing that a set of seed documents that doesn’t have good coverage of the various relevant concepts for the case (i.e., a seed set that is biased) could miss pockets of relevant documents. I would have found what I expected except that I started with continuous active learning (CAL) instead of simple passive learning (SPL), meaning that when I pulled a batch of documents predicted to be most likely to be relevant and reviewed them I allowed the system to learn from the tags that I applied so it could make better predictions when I pulled the next batch (that approach happened to be more convenient with our software at the time). What I found was that as I pulled more and more batches of documents that were predicted to be relevant, allowing the system to update its predictions as I went, it continued to find relevant documents until I had well over 90% recall. It never “got stuck on an island” where it couldn’t get to the remaining relevant documents because they were too different from the documents it had already seen. It has taken me a year to get around to writing about this because I wanted to do more testing.

Weak Seed

Could the single seed document I picked have been unusually potent? After achieving 90% recall I randomly selected one of the relevant documents the system didn’t find and started over using that document as the single seed document. If the system had so much trouble finding that document, it must surely be rather different from the other relevant documents and would serve as a very weak seed. I was able to hit 90% recall with CAL even with the single weak seed. The figure below compares the single random seed (left) to the single weak seed (right) for passive learning (top row) and continuous active learning (bottom row). Each bar represents a batch of documents, with the number of documents that were actually relevant represented in orange. Click the figure for a larger view.

The weak seed is seen to be quite ineffective with passive learning (upper right graph). It finds a modest number of relevant documents during the first three batches (no learning between batches). After three batches there are no relevant documents left that are remotely similar to the seed document, so the system only finds relevant documents by tripping over them at random (prevalence is 2.6%). The single random seed on the left did much better with passive learning than the weak seed, but it still ran out of gas while leaving many relevant documents undiscovered. Both seeds worked well with CAL. The first bar in the CAL chart for the weak seed shows it finding only a modest number of relevant documents (the same as SPL), but when it analyzes those documents and learns from them it is able to do much better with the second batch. The initial disadvantage of the weak seed is quickly erased.

Wrong Seed

If a weak seed works, what about a seed that is totally wrong? I picked a random document that was not remotely close to being relevant and tagged it as relevant. To make things a little more interesting, I also tagged the random seed document from the test above as non-relevant. So, my seed set consisted of two documents telling the system that non-relevant documents are relevant and relevant documents are non-relevant. When I ask the system to give me batches of documents that are predicted to be relevant it will surely give me a bunch of non-relevant documents, which I would then tag correctly. Will it be able find its way to the relevant documents in spite of starting out going in the completely wrong direction? Here is the result:

As expected, the first batch was purely non-relevant documents, but it was able to learn something from them–it learned what types of documents to avoid. In the second batch it managed to stumble across a single relevant (and correctly tagged!) document. The single seed hypothesis says that that single relevant document in the second batch should be enough to find virtually all of the relevant documents, and it was. It hit 97% recall in the graph above (before I decided to stop it). Also, the software warned me when the calculation was finished that the two seed documents appeared to be tagged incorrectly–my attempt at sabotage was detected!

Before proceeding with more experiments, I want to mention a point about efficiency. When the system has seen only a single relevant training document it doesn’t know which words in the document made it relevant, so the first batch of predictions may not be very good–it may pick documents because they contain certain words seen in the seed document when those words were not particularly important. As a result, it is more efficient to do a few small batches to allow it to sort out which words really matter before moving on to larger batches. Optimal batch size depends on the task, the document set, and the classification algorithm, but small is generally better than large.

Disjoint Relevance

Maybe the relevant documents for the categorization task performed above were too homogeneous. Could it find everything if my definition of relevance included documents that were extremely different from each other? To test that I used a bunch of news articles and I defined a document to be relevant if it was about golf or biology. There were no articles that were about both golf and biology. If I seeded it with a single biology article, could it find the golf articles and vice versa? This figure shows the results:

It achieved 88% recall after reviewing 5.5% of the document population in both cases (prevalence was 3.6%). The top graph was seeded with a single random biology article, whereas the bottom one was seeded with a single random golf article. Golf articles are much easier for an algorithm to identify (once it has seen one in training) than biology articles. The bottom graph should serve as a warning to not give up too soon if it looks like the system is no longer finding relevant documents.

How did it go from finding articles about biology to articles about golf? The first golf article was found because it contained words like Watson, stole, Stanford, British, etc. None of the words would be considered very strong indicators of relevance for biology. When the low-hanging fruit has already been picked, the algorithm is going to start probing documents containing words that have been seen in relevant documents but whose importance is less certain. If that leads to a new relevant document, the system is exposed to new words that may be good indicators of relevance (e.g., golf-related words in this case), leading to additional relevant documents in later batches. If the documents it probes turn out to be non-relevant, it learns that those words weren’t good indicators of relevance and it heads in a different direction. You can think of it as being like the Six Degrees of Kevin Bacon game–the algorithm can get from the single seed document to virtually any relevant document by hopping from one relevant document to another via the words they have in common, discovering new words that allow connecting to different documents as it goes.

Performance

If it is possible to find most of the relevant documents from a single seed, is it an efficient approach? The figure below addresses that question. The arrows indicate various recall levels.

The first graph above shows SPL with the optimal amount of random training to reach 75% recall with the least total document review. The first eight batches are the random training documents–you can see that those batches contain very few relevant documents. After the eight training batches, documents that are predicted to be relevant are pulled in batches. For SPL, the system is not allowed to learn beyond the initial random training documents. The second graph shows CAL with the same set of random training documents. You can see that it reached 75% recall more quickly, and it reached 88% recall with the amount of document review that SPL took to reach 75% recall. The final graph shows CAL with a single seed. You can see in the figure above that two small batches of documents predicted to be relevant were reviewed before moving to full-sized batches.

The figure shows that the single seed CAL result usually hit a recall level two or three batches later than the more heavily trained CAL result, but it also had nearly eight batches less of “training” data (I’m putting quotes around training here because all reviewed documents are really training documents with CAL–the system learns from all of them), so the improvement from the random training data (2 or 3 batches less of “review”) wasn’t sufficient to cover the cost (8 more batches of “training”). The relative benefit of training with randomly selected documents may vary depending on the situation (e.g., reducing the “review” phase at the expense of more “training” for a larger document collection may be more worthwhile), but at least in the example above random sampling for training isn’t worthwhile beyond finding the first relevant seed document, which could probably be found more efficiently with keyword search. Judgmental sampling may be worthwhile if it is good at finding a diverse set of relevant documents while avoiding non-relevant ones.

The table below shows the proportion of the document set that must be reviewed, including training documents, to reach 75% recall for several different categorization tasks with varying prevalence and difficulty. In each case SPL was trained with a set of random documents with size optimized to achieve 75% recall with minimal review. The result called simply “CAL” uses the same random training set as the SPL result but allows learning to continue when batches of relevant documents are pulled. It would be unusual to use a large amount of random training documents with CAL, rather than using judgmental sampling, but I wanted to be able to show how much CAL improves on SPL with the same seed set and then show the additional benefit of reducing the seed set down to a single relevant document.

Task	Prevalence	SPL	CAL	CAL rand SS	CAL weak SS
1	6.9%	10.9%	8.3%	7.7%	7.6%
2	4.1%	4.3%	3.7%	3.5%	3.5%
3	2.9%	16.6%	12.5%	8.6%	8.8%
4	1.1%	10.9%	6.7%	3.2%	3.2%
5	0.68%	1.8%	1.8%	0.8%	0.9%
6	0.52%	29.6%	8.4%	5.2%	7.1%
7	0.32%	25.7%	17.1%	2.6%	2.6%

In every case CAL beat SPL, and a single seed (whether random or weak) was always better than using CAL with the full training set that was used for SPL. Of course, it is possible that CAL with a smaller random seed set or a judgmental sample would be better than CAL with a single seed.

Short Documents

Since the algorithm hops from one relevant document to another by probing words that the documents have in common, will it get stuck if the documents are short because there are fewer words to explore? To test that I truncated each document at 350 characters, being careful not to cut any words in half. With an average of 95% of the document text removed, there will surely be some documents where all of the text that made them relevant is gone, so they’ll be tagged as relevant but there is virtually nothing in the text to justify considering them to be relevant, which will make performance metrics look bad. This table gives the percentage of the document population that must be reviewed, including any training, to reach 75% recall compared to SPL (trained with optimal number of random docs to reach 75% recall):

Task	Prevalence	SPL	CAL rand SS	CAL weak SS
Full Docs	2.6%	5.0%	3.0%	3.1%
Truncated Docs	2.6%	19.4%	7.1%	6.9%

The table shows that CAL with a single seed requires much less document review than SPL regardless of whether the documents are long or short, and even a weak seed works with the short documents.

How Seed Sets Influence Which Documents are Found

I’ve shown that you can find most of the relevant documents with CAL using any single seed, but does the seed impact which relevant documents you find? The answer is that it has very little impact. Whatever seed you start with, the algorithm is going to want to move toward the relevant documents that are easiest for it to identify. It may take a few batches before it gets exposed to the features that lead it to the easy documents (e.g., words that are strong indicators of relevance but are also fairly common, so it is easy to encounter them and they lead to a lot of relevant documents), but once it encounters the relevant documents that are easy to identify they will quickly overwhelm the few oddball relevant documents that may have come from a weak seed. The predictions that generate later batches are heavily influenced by the relevant documents that are easy for the algorithm to find, and each additional batch of documents and associated learning erases more and more of impact from the starting seed. To illustrate this point, I ran calculations for a difficult categorization task (relevant documents were scattered across many small concept clusters) and achieved exactly 75% recall using various approaches. I then compared the results to see how many documents the different approaches had in common. Each approach found 848 relevant documents. Here is the overlap between the results:

Row	Comparison	Num Relevant Docs in Common
1	Algorithm 1: CAL Rand SS v. CAL Weak SS	822
2	Algorithm 2: CAL Rand SS v. CAL Weak SS	821
3	Algorithm 1 CAL Rand SS v. Algorithm 2 CAL Rand SS	724
4	Algorithm 1: SPL Rand₁ v. SPL Rand₂	724
5	Algorithm 1: SPL Rand₁ v. CAL Rand SS	725
6	Algorithm 1: SPL Rand₁ v. SPL Biased Seed	708
7	Algorithm 1: CAL Rand SS v. CAL Biased Seed	824

The maximum possible number of documents that two calculations can have in common is 848 (i.e., all of them), and the absolute minimum is 565 because if one approach gives 75% recall a second approach can find at most the 25% of the full set of relevant documents the first didn’t find and then it must resort to finding documents that the first approach found. If two approaches were completely independent (think of one approach as picking relevant documents randomly), you would expect them to have 636 documents in common. So, it is reasonable to expect the numbers in the right column of the table to lie between 636 and 848, and they probably shouldn’t get very close to 636 since no two predictive coding approaches should be completely independent because they will detect many of the same patterns that make a document relevant.

Row 1 of the table shows that CAL with the same algorithm and two different single seeds, one random and one weak, give nearly the same set of relevant documents, with 822 of the 848 relevant documents found being the same. Row 2 shows that the result from Row 1 also holds for a different classification algorithm. Row 3 shows that if we use two different classification algorithms but use the same seed, the agreement between the results is much lower at just 724 documents. In other words, Rows 1-3 combined show that the specific relevant documents found with CAL depends much more on the classification algorithm used than on the seed document(s).

Row 4 shows that SPL with two different training sets of 4,000 random documents generates results with modest agreement, and Row 5 shows that the agreement is comparable between SPL trained with 4,000 random documents and CAL with a single random seed, so the CAL result is not particularly abnormal compared to SPL.

Row 6 compares SPL with 4,000 random training documents to SPL with a training set from a very biased search query that yielded 272 documents with 51 of them being relevant (prevalence for the full document population is 1.1%). The biased training set with SPL gives a result that is more different from SPL with random training than anything else tested. In other words, with SPL any bias in the training set is reflected in the final result. Row 7 shows that when that same biased training set is fed to CAL it has virtually no impact on which relevant documents are returned–the result is almost the same as the single seed with CAL.

Other Classification Algorithms

Will other classification algorithms work with a single relevant seed document? I tried a different (not terribly good) classification algorithm and it did work with a single seed, although it took somewhat more document review to reach the same level of recall. Keep in mind that all predictive coding software is different (there are many layers of algorithms and many different algorithms at each layer), so your mileage may vary. You should consult with your vendor when considering a different workflow, and always test the results for each case to ensure there are no surprises. The algorithm’s hyperparameters (e.g., amount of regularization) may need to be adjusted for optimal performance with single seed CAL.

Can It Fail?

A study by Grossman and Cormack shows a case where the single seed hypothesis failed for Topic 203 in Table 6 on page 159. They used a random sample containing two relevant documents and applied CAL, but the system had to go through 88% of the document population to find 75% of the relevant documents. It didn’t merely do a bad job of finding relevant documents–it actively avoided them (worse than reviewing documents randomly)! Gordon Cormack was kind enough to reply to my emails about this odd occurrence. I’m not sure this one is fully understood (machine learning can get a little complicated under the hood), but I think it’s fair to say that there was a strange confluence between some odd data and the way the algorithm interpreted it that allowed the algorithm to get stuck and not learn.

Here are some things that I could see (pure conjecture without any testing) potentially causing problems. If the document population contains multiple languages, I would not expect a single seed to be enough. One relevant seed document per language would be required because I don’t think documents in different languages would have enough words in common for the algorithm to wander across a language boundary. A classification algorithm that is too stiff (e.g., too much regularization) may fail to find significant pockets of documents–you really want the system to easily get pushed in different directions when a new relevant document is discovered. If there is a particular type of document in the population that contains the same common chunk of text in every document while relevance is determined by some other part of the document, you may need a relevant seed document from that document type or the system may conclude that the common chunk of text that occurs in all documents of that type is such a strong non-relevance indicator that they may not be probed enough to learn that some of them are relevant.

Conclusions

My tests were performed on fairly clean documents with no near-dupes. Actual e-discovery data can be uglier, and it can be hard to determine how a specific algorithm will react to it. There is no guarantee that a single relevant seed document will be enough, but the experiments I’ve described should at least suggest that with CAL the seed set can be quite minimal, which allows relevant documents to be reviewed earlier in the process. Avoiding large training sets also means that predictive coding with CAL can be worthwhile for smaller document collections. Finally, with CAL, unlike SPL, the specific relevant documents that are found depend almost entirely on the algorithm used, not the seed set, so there is little point in arguing about seed set quality if CAL is used.

Can You Really Compete in TREC Retroactively?

1 Reply

I recently encountered a marketing piece where a vendor claimed that their tests showed their predictive coding software demonstrated favorable performance compared to the software tested in the 2009 TREC Legal Track for Topic 207 (finding Enron emails about fantasy football). I spent some time puzzling about how they could possibly have measured their performance when they didn’t actually participate in TREC 2009.

One might question how meaningful it is to compare to performance results from 2009 since the TREC participants have probably improved their software over the past six years. Still, how could you do the comparison if you wanted to? The stumbling block is that TREC did not produce a yes/no relevance determination for all of the Enron emails. Rather, they did stratified sampling and estimated recall and prevalence for the participating teams by producing relevance determinations for just a few thousand emails.

Stratified sampling means that the documents are separated into mutually-exclusive buckets called “strata.” To the degree that stratification manages to put similar things into the same stratum, it can produce better statistical estimates (smaller uncertainty for a given amount of document review). The TREC Legal Track for 2009 created a stratum containing documents that all participants agreed were relevant. It also created four strata containing documents that all but one participant predicted were relevant (there were four participants, so one stratum for each dissenting participant). There were six strata where two participants agreed on relevance, and four strata where only one of the four participants predicted the documents were relevant. Finally, there was one stratum containing documents that all participants predicted were non-relevant, which was called the “All-N” stratum. So, for each stratum a particular participant either predicted that all of the documents were relevant or they predicted that all of the documents were non-relevant. You can view details about the strata in table 21 on page 39 here. Here is an example of what a stratification might look like for just two participants (the number of documents shown and percentage that are relevant may differ from the actual data):

A random subset of documents from each stratum was chosen and reviewed so that the percentage of the documents in the stratum that were relevant could be estimated. Multiplying that percentage by the number of documents in the stratum gives an estimate for the number of relevant documents in the stratum. Combining the results for the various strata allows precision and recall estimates to be computed for each participant. How could this be done for a team that didn’t participate? Before presenting some ideas, it will be useful to have some notation:

N[i] = number of documents in stratum i
n[i] = num docs in i that were assessed by TREC
n⁺[i] = num docs in i that TREC assessed as relevant
V[i] = num docs in i that vendor predicted were relevant
v[i] = num docs in i that vendor predicted were relevant and were assessed by TREC
v⁺[i] = num docs in i that vendor predicted were relevant and assessed as relevant by TREC

To make some of the discussion below more concrete, I’ll provide formulas for computing the number of true positives (TP), false positives (FP), and false negatives (FN). The recall and precision can then be computed from:

R = TP / (TP + FN)
P = TP / (TP + FP)

Here are some ideas I came up with:

1) They could have checked to see which strata the documents they predicted to be relevant fell into and applied the percentages TREC computed to their data. The problem is that since they probably didn’t identify all of the documents in a stratum as being relevant the percentage of documents that were estimated to be relevant for the stratum by TREC wouldn’t really be applicable. If their system worked really well, they may have only predicted that the truly relevant documents from the stratum were relevant. If their system worked badly, their system may have predicted that only the truly non-relevant documents from the stratum were relevant. This approach could give estimates that are systematically too low or too high. Here are the relevant formulas (summing over strata, i):

TP = Sum{ V[i] * n⁺[i] / n[i] }
FP = Sum{ V[i] * (1 – n⁺[i]/n[i]) }
FN = Sum{ (N[i] – V[i]) * n⁺[i] / n[i] }

2) Instead of using the percentages computed by TREC, they could have computed their own percentages by looking at only the documents in the stratum that they predicted were relevant and were reviewed by TREC to give a relevance determination. This would eliminate the possible bias from approach (1), but it also means that the percentages would be computed from a smaller sample, so the uncertainty in the percentage that are relevant would be bigger. The vendor didn’t provide confidence intervals for their results. Here is how the computation would go:

TP = Sum{ V[i] * v⁺[i] / v[i] }
FP = Sum{ V[i] * (1 – v⁺[i]/v[i]) }
FN = Sum{ (N[i] – V[i]) * (n⁺[i] – v⁺[i]) / (n[i] – v[i]) }

It’s possible that for some strata there would be no overlap between the documents TREC assessed and the documents the vendor predicted to be relevant since TREC typically assessed only about 4% of each stratum for Topic 207 (except the All-N stratum, where they assessed only 0.46%). This approach wouldn’t work for those strata since v[i] would be 0. For strata where v[i] is 0, one might use approach (1) and hope it isn’t too wrong.

3) A more sophisticated tweak on (2) would be to use the ratio n⁺[i]/n[i] from (1) to generate a Bayesian prior probability distribution for the proportion of documents predicted by the vendor to be relevant that actually are relevant, and then use v⁺[i] and v[i] to compute a posterior distribution for that proportion and use the mean of that distribution instead of v⁺[i]/v[i] in the computation. The idea is to have a smooth interpolation between using n⁺[i]/n[i] and using v⁺[i]/v[i] as the proportion of documents estimated to be relevant, where the interpolation would be closer to v⁺[i]/v[i] if v[i] is large (i.e., if there is enough data for v⁺[i]/v[i] to be reasonably accurate). The result would be sensitive to choices made in creating the Bayesian prior (i.e., how much variance to give the probability distribution), however.

4) They could have ignored all of the documents that weren’t reviewed in TREC (over 500,000 of them) and just performed their predictions and analysis on the 3,709 documents that had relevance assessments (training documents should come from the set TREC didn’t assess and should be reviewed by the vendor to simulate actual training at TREC being done by the participants). It would be very important to weight the results to compensate for the fact that those 3,709 documents didn’t all have the same probability of being selected for review. TREC oversampled the documents that were predicted to be relevant compared to the remainder (i.e., the number of documents sampled from a stratum was not simply proportional to the number of documents in the stratum), which allowed their stratification scheme to do a good job of comparing the participating teams to each other at the expense of having large uncertainty for some quantities like the total number of relevant documents. The prevalence of relevant documents in the full population was 1.5%, but 9.0% of the documents having relevance assessments were relevant. Without weighting the results to compensate for the uneven sampling, you would be throwing away over half a million non-relevant documents without giving the system being tested the opportunity to incorrectly predict that some of them are relevant, which would lead to an inflated precision estimate. The expression “shooting fish in a barrel” comes to mind. Weighting would be accomplished by dividing by the probability of the document having been chosen (after this article was published I learned that this is called the Horvitz-Thompson estimator, and it is what the TREC evaluation toolkit uses), which is just n[i]/N[i], so the computation would be:

TP = Sum{ (N[i]/n[i]) * v⁺[i] }
FP = Sum{ (N[i]/n[i]) * (v[i] – v⁺[i]) }
FN = Sum{ (N[i]/n[i]) * (n⁺[i] – v⁺[i]) }

Note that if N[i]/n[i] is equal to V[i]/v[i], which is expected to be approximately true since the subset of a stratum chosen for assessment by TREC is random, the result would be equal to that from (2). If N[i]/n[i] is not equal to V[i]/v[i] for a stratum, we would have the disturbing result that the estimate for TP+FP for that stratum would not equal the number of documents the vendor predicted to be relevant for that stratum, V[i].

5) The vendor could have ignored the TREC relevance determinations, simply doing their own. That would be highly biased in the vendor’s favor because there would be a level of consistency between relevance determinations for the training data and testing data that did not exist for TREC participants. At TREC the participants made their own relevance determinations to train their systems and a separate set of Topic Authorities made the final relevance judgments that determined the performance numbers. To the degree that participants came to different conclusions about relevance compared to the Topic Authorities, their performance numbers would suffer. A more subtle problem with this approach is that the vendor’s interpretation of the relevance criteria would inevitably be somewhat different from that of TREC assessors (studies have shown poor agreement between different review teams), which could make the classification task either easier or harder for a computer. As an extreme example, if the vendor took all documents containing the word “football” to be relevant and all other documents to be non-relevant, it would be very easy for a predictive coding system to identify that pattern and achieve good performance numbers.

Approaches (1)-(4) would all give the same results for the original TREC participants because for each stratum they would either have V[i]=0 (so v[i]=0 and v⁺[i]=0) or they would have V[i]=N[i] (so v[i]=n[i] and v⁺[i]=n⁺[i]). The approaches differ in how they account for the vendor predicting that only a subset of a stratum is relevant. None of the approaches described are great. Is there a better approach that I missed? TREC designed their strata to make the best possible comparisons between the participants. It’s hard to imagine how an analysis could be as accurate for a system that was not taken into account in the stratification process. If a vendor is tempted to make such comparisons, they should at least disclose their methodology and provide confidence intervals on their results so prospective clients can determine whether the performance numbers are actually meaningful.

Clustify Blog – eDiscovery, Document Clustering, Technology-Assisted Review (Predictive Coding), Information Retrieval, and Software Development

Thoughts on e-discovery, computers, and software development.