There has been some debate recently about the value of the “eRecall” method compared to the “Direct Recall” method for estimating the recall achieved with technology-assisted review. This article shows why eRecall requires sampling and reviewing just as many documents as the direct method if you want to achieve the same level of certainty in the result.
Here is the equation:
eRecall = (TotalRelevant – RelevantDocsMissed) / TotalRelevant
Rearranging a little:
eRecall = 1 – RelevantDocsMissed / TotalRelevant
= 1 – FractionMissed * TotalDocumentsCulled / TotalRelevant
It requires estimation (via sampling) of two quantities: the total number of relevant documents, and the number of relevant documents that were culled by the TAR tool. If your approach to TAR involves using only random sampling for training, you may have a very good estimate of the prevalence of relevant documents in the full population by simply measuring it on your (potentially large) training set, so you multiply the prevalence by the total number of documents to get TotalRelevant. To estimate the number of relevant documents missed (culled by TAR), you would need to review a random sample of the culled documents to measure the percentage of them that were relevant, i.e. FractionMissed (commonly known as the false omission rate or elusion). How many?
To simplify the argument, let’s assume that the total number of relevant documents is known exactly, so there is no need to worry about the fact that the equation involves a non-linear combination of two uncertain quantities. Also, we’ll assume that the prevalence is low, so the number of documents culled will be nearly equal to the total number of documents. For example, if the prevalence is 1% we might end up culling about 95% to 98% of the documents. With this approximation, we have:
eRecall = 1 – FractionMissed / Prevalence
It is the very small prevalence value in the denominator that is the killer–it amplifies the error bar on FractionMissed, which means we have to take a ton of samples when measuring FractionMissed to achieve a reasonable error bar on eRecall.
Let’s try some specific numbers. Suppose the prevalence is 1% and the recall (that we’re trying to estimate) happens to be 75%. Measuring FractionMissed should give a result of about 0.25% if we take a big enough sample to have a reasonably accurate result. If we sampled 4,000 documents from the culled set and 10 of them were relevant (i.e., 0.25%), the 95% confidence interval for FractionMissed would be (using an exact confidence interval calculator to avoid getting bad results when working with extreme values, as I advocated in a previous article):
FractionMissed = 0.12% to 0.46% with 95% confidence (4,000 samples)
Plugging those values into the eRecall equation gives a recall estimate ranging from 54% to 88% with 95% confidence. Not a very tight error bar!
If the number of samples was increased to 40,000 (with 100 being relevant, so 0.25% again), we would have:
FractionMissed = 0.20% to 0.30% with 95% confidence (40,000 samples)
Plugging that into the eRecall equation gives a recall estimate ranging from 70% to 80% with 95% confidence, so we have now reached the ±5% level that people often aim for.
For comparison, the Direct Recall method would involve pulling a sample of 40,000 documents from the whole document set to identify roughly 400 random relevant documents, and finding that roughly 300 of the 400 were correctly predicted by the TAR system (i.e., 75% recall). Using the calculator with a sample size of 400 and 300 relevant (“relevant” for the calculator means correctly-identified for our purposes here) gives a recall range of 70.5% to 79.2%.
So, the number of samples required for eRecall is about the same as the Direct Recall method if you require a comparable amount of certainty in the result. There’s no free lunch to be found here.