There has been some debate recently about the value of the “eRecall” method compared to the “Direct Recall” method for estimating the recall achieved with technology-assisted review. This article shows why eRecall requires sampling and reviewing just as many documents as the direct method if you want to achieve the same level of certainty in the result.

Here is the equation:

eRecall = (TotalRelevant – RelevantDocsMissed) / TotalRelevant

Rearranging a little:

eRecall = 1 – RelevantDocsMissed / TotalRelevant

= 1 – FractionMissed * TotalDocumentsCulled / TotalRelevant

It requires estimation (via sampling) of two quantities: the total number of relevant documents, and the number of relevant documents that were culled by the TAR tool. If your approach to TAR involves using only random sampling for training, you may have a very good estimate of the prevalence of relevant documents in the full population by simply measuring it on your (potentially large) training set, so you multiply the prevalence by the total number of documents to get TotalRelevant. To estimate the number of relevant documents missed (culled by TAR), you would need to review a random sample of the culled documents to measure the percentage of them that were relevant, i.e. FractionMissed (commonly known as the **false omission rate** or **elusion**). How many?

To simplify the argument, let’s assume that the total number of relevant documents is known exactly, so there is no need to worry about the fact that the equation involves a non-linear combination of two uncertain quantities. Also, we’ll assume that the prevalence is low, so the number of documents culled will be nearly equal to the total number of documents. For example, if the prevalence is 1% we might end up culling about 95% to 98% of the documents. With this approximation, we have:

eRecall = 1 – FractionMissed / Prevalence

It is the very small prevalence value in the denominator that is the killer–it amplifies the error bar on FractionMissed, which means we have to take a ton of samples when measuring FractionMissed to achieve a reasonable error bar on eRecall.

Let’s try some specific numbers. Suppose the prevalence is 1% and the recall (that we’re trying to estimate) happens to be 75%. Measuring FractionMissed should give a result of about 0.25% if we take a big enough sample to have a reasonably accurate result. If we sampled 4,000 documents from the culled set and 10 of them were relevant (i.e., 0.25%), the 95% confidence interval for FractionMissed would be (using an exact confidence interval calculator to avoid getting bad results when working with extreme values, as I advocated in a previous article):

FractionMissed = 0.12% to 0.46% with 95% confidence (4,000 samples)

Plugging those values into the eRecall equation gives a recall estimate ranging from 54% to 88% with 95% confidence. Not a very tight error bar!

If the number of samples was increased to 40,000 (with 100 being relevant, so 0.25% again), we would have:

FractionMissed = 0.20% to 0.30% with 95% confidence (40,000 samples)

Plugging that into the eRecall equation gives a recall estimate ranging from 70% to 80% with 95% confidence, so we have now reached the ±5% level that people often aim for.

For comparison, the Direct Recall method would involve pulling a sample of 40,000 documents from the whole document set to identify roughly 400 random relevant documents, and finding that roughly 300 of the 400 were correctly predicted by the TAR system (i.e., 75% recall). Using the calculator with a sample size of 400 and 300 relevant (“relevant” for the calculator means correctly-identified for our purposes here) gives a recall range of 70.5% to 79.2%.

So, the number of samples required for eRecall is about the same as the Direct Recall method if you require a comparable amount of certainty in the result. There’s no free lunch to be found here.

John TredennickA nice article on a subject I plan to write about as well. It seems that proving the number of relevant documents in the discard pile is a lot trickier than many people think. Or, that you have to do a lot more sampling that most would want to do, or think made sense. Thanks.

William WebberBill,

A great summary of the weaknesses in a proposed estimation method that is getting a surprising amount of publicity. Another thing to point out is that, in statistical terms, the original estimate of prevalence is no longer valid once you’ve actually started looking at documents in the collection. To take an extreme (but by no means impossible) case, you might have found more relevant documents in your review than the original prevalence estimate stated; then, clearly, the prevalence estimate is not meaningful. You’ve also got the problem that the prevalence sample and the null set sample have overlapping populations (the latter is a subset of the former); how do you reconcile them if they give different estimates of null set prevalence? I sincerely hope that no-one is using this ratio estimate method in practice and claiming that it is giving statistically valid estimates (that is, unless they’ve employed a very smart statistician to figure out complex interactions that it involves).