Monthly Archives: July 2014

Predictive Coding Confusion

This article looks at a few common misconceptions and mistakes related to predictive coding and confidence intervals.

Confidence intervals vs. training set size:  You can estimate the percentage of documents in a population having some property (e.g., is the document responsive, or does it contain the word “pizza”) by taking a random sample of the documents and measuring the percentage having that property.  The confidence interval tells you how much uncertainty there is due to your measurement being made on a sample instead of the full population.  If you sample 400 documents, the 95% confidence interval is +/- 5%, meaning that 95% of the time the range from -5% to +5% around your estimate will contain the actual value for the full population.  For example, if you sample 400 documents and find that 64 are relevant (16%), there is a 95% chance that the range 11% to 21% will enclose the actual prevalence for the full document set.  To cut the size of the confidence interval in half you need four times as many documents, so a sample of 1,600 documents gives a 95% confidence interval of +/- 2.5%.  The sample size needed to achieve a confidence interval of a certain size does not depend on the number of documents in the full population (unless the sample is a substantial proportion of the entire document population, which would be strange), so sample sizes like 400 or 1,600 documents can be committed to memory and applied to any document set.

Sample sizes related to confidence intervals have nothing to do with sample sizes needed to train a predictive coding algorithm.  Confidence intervals are for estimating the number of relevant documents, but training is about teaching the system to identify which documents are relevant.  A pollster could survey 1,600 voters and estimate the number that would vote for a particular candidate to within +/- 2.5%, but that would not enable him/her to predict who some arbitrary person would vote for — that’s just a completely different problem from estimating the number of votes.  For a predictive coding system to make good predictions it needs enough training documents for it to identify the patterns that indicate relevance.  The number of documents required for training depends on the algorithm used and the difficulty of the categorization task.  To illustrate that point, consider the two categorization tasks mentioned in my previous article.  Both involve the same set of 100,000 documents and have nearly the same prevalence of relevant documents (0.986% and 1.131%), but the difficulty is very different.  I measured the optimal number of random training documents to achieve 75% recall while reviewing the smallest possible total number of documents (training + review) and found:

Training Docs Review Docs Total Docs Reviewed
Task 1 300 800 1,100
Task 2 4,500 6,500 11,000

If the number of training documents is below the optimal level, the predictions won’t be very good (low precision) and you’ll have to review an excessive number of non-relevant documents to find 75% of the relevant documents.  If the number of training documents is above the optimal level, the benefit from higher precision achieved because of the extra training won’t be sufficient to offset the cost of reviewing the additional training documents.  You can see from the table that there is a factor of 15 difference in the optimal number of training documents for the two tasks.   Unlike the number of documents needed to achieve a certain confidence interval, there is no simple answer when it comes to the number of documents needed for training.

Sampling counts documents not importance: Sampling allows you to estimate the number of relevant documents that were missed by the predictive coding algorithm.  That doesn’t tell you anything about the importance of the documents that were missed.  As discussed in my article on relevance score, predictive coding algorithms put the documents that they are most confident are relevant at the top of the list.  The documents at the top are not necessarily the most important documents for the case.  To the extent that a “smoking gun” is very different from any of the documents in the training set, the algorithm may have little confidence that it is relevant, so it may get a modest or even low relevance score.  Claims that nothing critical is lost when documents below some relevance score cutoff are culled because the number of relevant documents that are discarded is small are simply unfounded.

Beware small percentages of large numbers: Suppose that a predictive coding system identifies 30,000 documents out of a population of a million as likely to be relevant, and the vendor claims 95% precision was achieved, meaning that 95% of the documents predicted to be relevant actually are relevant.  The vendor also claims, based on sampling, that only 1% of the predictions were false negatives, meaning that the system predicted that the documents were non-relevant when they were actually relevant.  How many relevant documents were found, and how many were missed?  The number found is:

95% * 30,000 = 28,500

The number missed is (I’m over-counting a little by not subtracting out documents used for training [i.e. no prediction was made for them], which were presumably a small fraction of the full population):

1% * 1,000,000 = 10,000

The recall, the percentage of relevant documents that were actually found by the predictive coding system, is (using point estimates in a non-linear equation like this is somewhat wrong, but in this case the error is less than 1%):

28,500 / (28,500 + 10,000) = 74%

That recall might be acceptable, but it isn’t great.  The seemingly small 1% has a big impact because it applies to the entire document population, not just the relatively small number of documents that were predicted to be relevant.  That 1% was estimated using sampling, so there is uncertainty in the value (there may also be uncertainty in the 95% precision value, but I’m going to ignore that for simplicity).  If 1,600 documents were sampled and 16 were found to be false negatives, the 95% confidence interval would seem to go from -1.5% to +3.5%.  How can the percentage be negative?  It can’t.  The +/- 2.5% interval is actually designed to accommodate the worst-case scenario, which occurs when 50% of the documents have the property being measured.  When the percentage of documents having the property is very far from 50%, as is often the case in e-discovery, the 95% confidence interval is smaller and is not centered on the estimated value.  Equations for the confidence interval in such situations involve approximations that won’t always be appropriate for the examples in this article, so I’m going to recommend that you use an exact confidence interval calculator instead of dealing with equations that will give wrong results if used when they’re not appropriate.  With 95% confidence the interval is 0.57% to 1.62%, so the number of relevant documents missed is between 5,700 and 16,200 with 95% confidence.  That means the recall is between 64% and 83% with 95% confidence.

What if the vendor based the 1% false negative number on a sample of 400 documents instead of 1,600?  The confidence interval would be 0.27% to 2.54%, so the recall would be between 53% and 91% with 95% confidence (if you prefer the one-tail confidence interval, the upper bound is 2.27% giving a minimum recall of 56%).  If all we have to go on is a 1% false negative number based on a 400 document sample, we cannot assume that we’ve found much more than half of the relevant documents!  Not only does the 1% false negative number have a big impact, but it has a big error bar compared to the value itself (1% might really be 2.54%), so the worst case scenario is pretty ugly.

Beware recall values lacking confidence intervals:  Suppose the vendor claims that 98% recall was achieved in the example above.  Is that plausible in light of the 1% false negative number?  Before diving into the math, it should be said that 98% recall is very high.  The closer you get to 100% recall, the harder it is to find additional relevant documents without having to wade through a lot of non-relevant documents because all the relevant documents that are easy to identify have already been found.  Achieving 95% precision at 98% recall, as claimed, would be close to perfection.  So, how big is the error bar for that 98% recall?

Without knowing the details of how the vendor calculated the recall, let’s try to come up with something plausible.  If the vendor set aside a control set (a random sample of documents that were reviewed but not used for training) of 1,600 documents to monitor the system’s ability to make good predictions as training progressed, and  50 of those documents were relevant and the system correctly predicted that 49 were relevant (so it missed just one), the recall estimate would be 49/50 = 98%.  Turning to the confidence interval calculator and keeping in mind that our sample size is 50, not 1600, because we’re estimating the proportion of the sampled relevant documents (there are only 50) that wererecall_error_bars correctly predicted to be relevant, we find with 95% confidence the range for the recall is from  89.3% to 99.9%.  So, the claimed 98% recall might only be 89%.  The recall estimate from the control set barely overlaps with the 53% to 91% recall range implied by a false negative number of 1% based on 400 sample documents, and definitely isn’t consistent with a 1% false negative number if that number was measured using 1,600 sample documents.  Looking at it from a different angle, if 1% of predictions are false negatives and the control set contains 1,600 documents, you would expect the control set to contain about sixteen relevant documents that the system predicted were non-relevant, but a claim of 98% recall implies that there was only one such document, not sixteen.  It’s hard to see the 1% false negative number as being consistent with 98% recall.  We’ve been using a 95% confidence level, which means that 5% of the time the confidence interval we compute from the data will fail to capture the real value, so with recall estimates coming from different samples (from the control set and the sample used to measure false negatives) inconsistent results could mean that one value is simply wrong.  Which one, though?  The bottom line is that there are several ways to estimate recall, and all available numbers should be tested for consistency.  Without a confidence interval, nobody will know how meaningful the the recall estimate really is.