This article shows that it is often possible to find the vast majority of the relevant documents in a collection by starting with a single relevant seed document and using continuous active learning (CAL). This has important implications for making review efficient, making predictive coding practical for smaller document sets, and putting eyes on relevant documents as early as possible, perhaps leading to settlement before too much is spent on document review. It also means that meticulously constructing seed sets and arguing about them with opposing counsel is probably a waste of time if CAL is used.
In one of the sessions at the ACEDS 2014 conference Bill Speros advocated using judgmental sampling (e.g., keyword search) to find relevant documents as training examples for predictive coding rather than using random sampling, which will not find many relevant documents if prevalence is low. I thought to myself that, while I agree with the premise, he should really warn about the possibility of bias and the fact that any probability estimates or relevance scores generated by the classification algorithm could be very distorted. To illustrate the problem of bias I decided that I would do an experiment when I got home. I would start with a single relevant training document and see how many of the relevant documents the system could find. I expected it to find only a subset of the relevant documents that were similar to the single seed document, showing that a set of seed documents that doesn’t have good coverage of the various relevant concepts for the case (i.e., a seed set that is biased) could miss pockets of relevant documents. I would have found what I expected except that I started with continuous active learning (CAL) instead of simple passive learning (SPL), meaning that when I pulled a batch of documents predicted to be most likely to be relevant and reviewed them I allowed the system to learn from the tags that I applied so it could make better predictions when I pulled the next batch (that approach happened to be more convenient with our software at the time). What I found was that as I pulled more and more batches of documents that were predicted to be relevant, allowing the system to update its predictions as I went, it continued to find relevant documents until I had well over 90% recall. It never “got stuck on an island” where it couldn’t get to the remaining relevant documents because they were too different from the documents it had already seen. It has taken me a year to get around to writing about this because I wanted to do more testing.
Could the single seed document I picked have been unusually potent? After achieving 90% recall I randomly selected one of the relevant documents the system didn’t find and started over using that document as the single seed document. If the system had so much trouble finding that document, it must surely be rather different from the other relevant documents and would serve as a very weak seed. I was able to hit 90% recall with CAL even with the single weak seed. The figure below compares the single random seed (left) to the single weak seed (right) for passive learning (top row) and continuous active learning (bottom row). Each bar represents a batch of documents, with the number of documents that were actually relevant represented in orange. Click the figure for a larger view.
The weak seed is seen to be quite ineffective with passive learning (upper right graph). It finds a modest number of relevant documents during the first three batches (no learning between batches). After three batches there are no relevant documents left that are remotely similar to the seed document, so the system only finds relevant documents by tripping over them at random (prevalence is 2.6%). The single random seed on the left did much better with passive learning than the weak seed, but it still ran out of gas while leaving many relevant documents undiscovered. Both seeds worked well with CAL. The first bar in the CAL chart for the weak seed shows it finding only a modest number of relevant documents (the same as SPL), but when it analyzes those documents and learns from them it is able to do much better with the second batch. The initial disadvantage of the weak seed is quickly erased.
If a weak seed works, what about a seed that is totally wrong? I picked a random document that was not remotely close to being relevant and tagged it as relevant. To make things a little more interesting, I also tagged the random seed document from the test above as non-relevant. So, my seed set consisted of two documents telling the system that non-relevant documents are relevant and relevant documents are non-relevant. When I ask the system to give me batches of documents that are predicted to be relevant it will surely give me a bunch of non-relevant documents, which I would then tag correctly. Will it be able find its way to the relevant documents in spite of starting out going in the completely wrong direction? Here is the result:
As expected, the first batch was purely non-relevant documents, but it was able to learn something from them–it learned what types of documents to avoid. In the second batch it managed to stumble across a single relevant (and correctly tagged!) document. The single seed hypothesis says that that single relevant document in the second batch should be enough to find virtually all of the relevant documents, and it was. It hit 97% recall in the graph above (before I decided to stop it). Also, the software warned me when the calculation was finished that the two seed documents appeared to be tagged incorrectly–my attempt at sabotage was detected!
Before proceeding with more experiments, I want to mention a point about efficiency. When the system has seen only a single relevant training document it doesn’t know which words in the document made it relevant, so the first batch of predictions may not be very good–it may pick documents because they contain certain words seen in the seed document when those words were not particularly important. As a result, it is more efficient to do a few small batches to allow it to sort out which words really matter before moving on to larger batches. Optimal batch size depends on the task, the document set, and the classification algorithm, but small is generally better than large.
Maybe the relevant documents for the categorization task performed above were too homogeneous. Could it find everything if my definition of relevance included documents that were extremely different from each other? To test that I used a bunch of news articles and I defined a document to be relevant if it was about golf or biology. There were no articles that were about both golf and biology. If I seeded it with a single biology article, could it find the golf articles and vice versa? This figure shows the results:
It achieved 88% recall after reviewing 5.5% of the document population in both cases (prevalence was 3.6%). The top graph was seeded with a single random biology article, whereas the bottom one was seeded with a single random golf article. Golf articles are much easier for an algorithm to identify (once it has seen one in training) than biology articles. The bottom graph should serve as a warning to not give up too soon if it looks like the system is no longer finding relevant documents.
How did it go from finding articles about biology to articles about golf? The first golf article was found because it contained words like Watson, stole, Stanford, British, etc. None of the words would be considered very strong indicators of relevance for biology. When the low-hanging fruit has already been picked, the algorithm is going to start probing documents containing words that have been seen in relevant documents but whose importance is less certain. If that leads to a new relevant document, the system is exposed to new words that may be good indicators of relevance (e.g., golf-related words in this case), leading to additional relevant documents in later batches. If the documents it probes turn out to be non-relevant, it learns that those words weren’t good indicators of relevance and it heads in a different direction. You can think of it as being like the Six Degrees of Kevin Bacon game–the algorithm can get from the single seed document to virtually any relevant document by hopping from one relevant document to another via the words they have in common, discovering new words that allow connecting to different documents as it goes.
If it is possible to find most of the relevant documents from a single seed, is it an efficient approach? The figure below addresses that question. The arrows indicate various recall levels.
The first graph above shows SPL with the optimal amount of random training to reach 75% recall with the least total document review. The first eight batches are the random training documents–you can see that those batches contain very few relevant documents. After the eight training batches, documents that are predicted to be relevant are pulled in batches. For SPL, the system is not allowed to learn beyond the initial random training documents. The second graph shows CAL with the same set of random training documents. You can see that it reached 75% recall more quickly, and it reached 88% recall with the amount of document review that SPL took to reach 75% recall. The final graph shows CAL with a single seed. You can see in the figure above that two small batches of documents predicted to be relevant were reviewed before moving to full-sized batches.
The figure shows that the single seed CAL result usually hit a recall level two or three batches later than the more heavily trained CAL result, but it also had nearly eight batches less of “training” data (I’m putting quotes around training here because all reviewed documents are really training documents with CAL–the system learns from all of them), so the improvement from the random training data (2 or 3 batches less of “review”) wasn’t sufficient to cover the cost (8 more batches of “training”). The relative benefit of training with randomly selected documents may vary depending on the situation (e.g., reducing the “review” phase at the expense of more “training” for a larger document collection may be more worthwhile), but at least in the example above random sampling for training isn’t worthwhile beyond finding the first relevant seed document, which could probably be found more efficiently with keyword search. Judgmental sampling may be worthwhile if it is good at finding a diverse set of relevant documents while avoiding non-relevant ones.
The table below shows the proportion of the document set that must be reviewed, including training documents, to reach 75% recall for several different categorization tasks with varying prevalence and difficulty. In each case SPL was trained with a set of random documents with size optimized to achieve 75% recall with minimal review. The result called simply “CAL” uses the same random training set as the SPL result but allows learning to continue when batches of relevant documents are pulled. It would be unusual to use a large amount of random training documents with CAL, rather than using judgmental sampling, but I wanted to be able to show how much CAL improves on SPL with the same seed set and then show the additional benefit of reducing the seed set down to a single relevant document.
||CAL rand SS
||CAL weak SS
In every case CAL beat SPL, and a single seed (whether random or weak) was always better than using CAL with the full training set that was used for SPL. Of course, it is possible that CAL with a smaller random seed set or a judgmental sample would be better than CAL with a single seed.
Since the algorithm hops from one relevant document to another by probing words that the documents have in common, will it get stuck if the documents are short because there are fewer words to explore? To test that I truncated each document at 350 characters, being careful not to cut any words in half. With an average of 95% of the document text removed, there will surely be some documents where all of the text that made them relevant is gone, so they’ll be tagged as relevant but there is virtually nothing in the text to justify considering them to be relevant, which will make performance metrics look bad. This table gives the percentage of the document population that must be reviewed, including any training, to reach 75% recall compared to SPL (trained with optimal number of random docs to reach 75% recall):
||CAL rand SS
||CAL weak SS
The table shows that CAL with a single seed requires much less document review than SPL regardless of whether the documents are long or short, and even a weak seed works with the short documents.
How Seed Sets Influence Which Documents are Found
I’ve shown that you can find most of the relevant documents with CAL using any single seed, but does the seed impact which relevant documents you find? The answer is that it has very little impact. Whatever seed you start with, the algorithm is going to want to move toward the relevant documents that are easiest for it to identify. It may take a few batches before it gets exposed to the features that lead it to the easy documents (e.g., words that are strong indicators of relevance but are also fairly common, so it is easy to encounter them and they lead to a lot of relevant documents), but once it encounters the relevant documents that are easy to identify they will quickly overwhelm the few oddball relevant documents that may have come from a weak seed. The predictions that generate later batches are heavily influenced by the relevant documents that are easy for the algorithm to find, and each additional batch of documents and associated learning erases more and more of impact from the starting seed. To illustrate this point, I ran calculations for a difficult categorization task (relevant documents were scattered across many small concept clusters) and achieved exactly 75% recall using various approaches. I then compared the results to see how many documents the different approaches had in common. Each approach found 848 relevant documents. Here is the overlap between the results:
||Num Relevant Docs in Common
||Algorithm 1: CAL Rand SS v. CAL Weak SS
||Algorithm 2: CAL Rand SS v. CAL Weak SS
||Algorithm 1 CAL Rand SS v. Algorithm 2 CAL Rand SS
||Algorithm 1: SPL Rand1 v. SPL Rand2
||Algorithm 1: SPL Rand1 v. CAL Rand SS
||Algorithm 1: SPL Rand1 v. SPL Biased Seed
||Algorithm 1: CAL Rand SS v. CAL Biased Seed
The maximum possible number of documents that two calculations can have in common is 848 (i.e., all of them), and the absolute minimum is 565 because if one approach gives 75% recall a second approach can find at most the 25% of the full set of relevant documents the first didn’t find and then it must resort to finding documents that the first approach found. If two approaches were completely independent (think of one approach as picking relevant documents randomly), you would expect them to have 636 documents in common. So, it is reasonable to expect the numbers in the right column of the table to lie between 636 and 848, and they probably shouldn’t get very close to 636 since no two predictive coding approaches should be completely independent because they will detect many of the same patterns that make a document relevant.
Row 1 of the table shows that CAL with the same algorithm and two different single seeds, one random and one weak, give nearly the same set of relevant documents, with 822 of the 848 relevant documents found being the same. Row 2 shows that the result from Row 1 also holds for a different classification algorithm. Row 3 shows that if we use two different classification algorithms but use the same seed, the agreement between the results is much lower at just 724 documents. In other words, Rows 1-3 combined show that the specific relevant documents found with CAL depends much more on the classification algorithm used than on the seed document(s).
Row 4 shows that SPL with two different training sets of 4,000 random documents generates results with modest agreement, and Row 5 shows that the agreement is comparable between SPL trained with 4,000 random documents and CAL with a single random seed, so the CAL result is not particularly abnormal compared to SPL.
Row 6 compares SPL with 4,000 random training documents to SPL with a training set from a very biased search query that yielded 272 documents with 51 of them being relevant (prevalence for the full document population is 1.1%). The biased training set with SPL gives a result that is more different from SPL with random training than anything else tested. In other words, with SPL any bias in the training set is reflected in the final result. Row 7 shows that when that same biased training set is fed to CAL it has virtually no impact on which relevant documents are returned–the result is almost the same as the single seed with CAL.
Other Classification Algorithms
Will other classification algorithms work with a single relevant seed document? I tried a different (not terribly good) classification algorithm and it did work with a single seed, although it took somewhat more document review to reach the same level of recall. Keep in mind that all predictive coding software is different (there are many layers of algorithms and many different algorithms at each layer), so your mileage may vary. You should consult with your vendor when considering a different workflow, and always test the results for each case to ensure there are no surprises. The algorithm’s hyperparameters (e.g., amount of regularization) may need to be adjusted for optimal performance with single seed CAL.
Can It Fail?
A study by Grossman and Cormack shows a case where the single seed hypothesis failed for Topic 203 in Table 6 on page 159. They used a random sample containing two relevant documents and applied CAL, but the system had to go through 88% of the document population to find 75% of the relevant documents. It didn’t merely do a bad job of finding relevant documents–it actively avoided them (worse than reviewing documents randomly)! Gordon Cormack was kind enough to reply to my emails about this odd occurrence. I’m not sure this one is fully understood (machine learning can get a little complicated under the hood), but I think it’s fair to say that there was a strange confluence between some odd data and the way the algorithm interpreted it that allowed the algorithm to get stuck and not learn.
Here are some things that I could see (pure conjecture without any testing) potentially causing problems. If the document population contains multiple languages, I would not expect a single seed to be enough. One relevant seed document per language would be required because I don’t think documents in different languages would have enough words in common for the algorithm to wander across a language boundary. A classification algorithm that is too stiff (e.g., too much regularization) may fail to find significant pockets of documents–you really want the system to easily get pushed in different directions when a new relevant document is discovered. If there is a particular type of document in the population that contains the same common chunk of text in every document while relevance is determined by some other part of the document, you may need a relevant seed document from that document type or the system may conclude that the common chunk of text that occurs in all documents of that type is such a strong non-relevance indicator that they may not be probed enough to learn that some of them are relevant.
My tests were performed on fairly clean documents with no near-dupes. Actual e-discovery data can be uglier, and it can be hard to determine how a specific algorithm will react to it. There is no guarantee that a single relevant seed document will be enough, but the experiments I’ve described should at least suggest that with CAL the seed set can be quite minimal, which allows relevant documents to be reviewed earlier in the process. Avoiding large training sets also means that predictive coding with CAL can be worthwhile for smaller document collections. Finally, with CAL, unlike SPL, the specific relevant documents that are found depend almost entirely on the algorithm used, not the seed set, so there is little point in arguing about seed set quality if CAL is used.