Measuring the recall achieved to within +/- 5% to demonstrate that a production is defensible can require reviewing a substantial number of random documents. For a case of modest size, the amount of review required to measure recall can be larger than the amount of review required to actually find the responsive documents with predictive coding. This article describes a new method requiring much less document review to demonstrate that adequate recall has been achieved. This is a brief overview of a more detailed paper I’ll be presenting at the DESI VII Workshop on June 12th (slides available here).
The proportion of a population having some property can be estimated to within +/- 5% by measuring the proportion on a random sample of 400 documents (you’ll also see the number 385 being used, but using 400 will make it easier to follow the examples). To measure recall we need to know what proportion of responsive documents are produced, so we need a sample of 400 random responsive documents. Since we don’t know which documents in the population are responsive, we have to select documents randomly and review them until 400 responsive ones are found. If prevalence is 10% (10% of the population is responsive), that means reviewing roughly 4,000 documents to find 400 that are relevant so that recall can be estimated. If prevalence is 1%, it means reviewing roughly 40,000 random documents to measure recall. This can be quite a burden.
Once recall is measured, a decision must be made about whether it is high enough. Suppose you decide that if at least 300 of the 400 random responsive documents were produced (75%) the production is acceptable. For any actual level of recall, the probability of accepting the production can be computed (see figure to right). The probability of accepting a production where the actual recall is less than 70% will be very low, and the probability of rejecting a production where the actual recall is greater than 80% will also be low — this comes from the fact that a sample of 400 responsive documents is sufficient to measure recall to within +/- 5%.
The idea behind the new method is to achieve the same probability profile for accepting/rejecting a production using a multi-stage acceptance test. The multi-stage test gives the possibility of stopping the process and declaring the production accepted/rejected long before reviewing 400 random responsive documents. The procedure is shown in the flowchart to the right (click to enlarge). A decision may be reached after reviewing enough documents to find just 25 random documents that are responsive. If a decision isn’t made after reviewing 25 responsive documents, review continues until 50 responsive documents are found and another test is applied. At worst, documents will be reviewed until 400 responsive documents are found (the same as the traditional direct recall estimation method).
The figure to the right shows six examples of the multi-stage acceptance test being applied when the actual recall is 85%. Since 85% is well above the 80% upper bound of the 75% +/- 5% range, we expect this production to virtually always be accepted. The figure shows that acceptance can occur long before reviewing a full 400 random responsive documents. The number of random responsive documents reviewed is shown on the vertical axis. Toward the bottom of the graph the sample is very small and the percentage of the sample that has been produced may deviate greatly from the right answer of 85%. As you go up the sample gets larger and the proportion of the sample that is produced is expected to get closer to 85%. When a green decision boundary is touched, causing the production to be accepted as having sufficiently high recall, the color of the remainder of the path is changed to yellow — the yellow part represents the document review that is avoided by using the multi-stage acceptance method (since the traditional direct recall measurement would involve going all the way to 400 responsive documents). As you can see, when the actual recall is 85% the number of random responsive documents that must be reviewed is often 50 or 100, not 400.
The figure to the right shows the average number of documents that must be reviewed using the multi-stage acceptance procedure from the earlier flowchart. The amount of review required can be much less than 400 random responsive documents. In fact, the further above/below the 75% target (called the “splitting recall” in the paper) the actual recall is, the less document review is required (on average) to come to a conclusion about whether the production’s recall is high enough. This creates an incentive for the producing party to aim for recall that is well above the minimum acceptable level since it will be rewarded with a reduced amount of document review to confirm the result is adequate.
It is important to note that the multi-stage procedure provides an accept/reject result, not a recall estimate. If you follow the procedure until an accept/reject boundary is hit and then use the proportion of the sample that was produced as a recall estimate, that estimate will be biased (the use of “unbiased” in the paper title refers to the sampling being done on the full population, not on a subset [such as the discard set] that would cause a bias due to inconsistency in review of different subsets).
You may want to use a splitting recall other than 75% for the accept/reject decision — the full paper provides tables of values necessary for doing that.