Substantial Reduction in Review Effort Required to Demonstrate Adequate Recall

Measuring the recall achieved to within +/- 5% to demonstrate that a production is defensible can require reviewing a substantial number of random documents.  For a case of modest size, the amount of review required to measure recall can be larger than the amount of review required to actually find the responsive documents with predictive coding.  This article describes a new method requiring much less document review to demonstrate that adequate recall has been achieved.  This is a brief overview of a more detailed paper I’ll be presenting at the DESI VII Workshop on June 12th (slides available here).

The proportion of a population having some property can be estimated to within +/- 5% by measuring the proportion on a random sample of 400 documents (you’ll also see the number 385 being used, but using 400 will make it easier to follow the examples).  To measure recall we need to know what proportion of responsive documents are produced, so we need a sample of 400 random responsive documents.  Since we don’t know which documents in the population are responsive, we have to select documents randomly and review them until 400 responsive ones are found.  If prevalence is 10% (10% of the population is responsive), that means reviewing roughly 4,000 documents to find 400 that are relevant so that recall can be estimated.  If prevalence is 1%, it means reviewing roughly 40,000 random documents to measure recall.  This can be quite a burden.

multistage_acceptance_from_multistageOnce recall is measured, a decision must be made about whether it is high enough.  Suppose you decide that if at least 300 of the 400 random responsive documents were produced (75%) the production is acceptable.  For any actual level of recall, the probability of accepting the production can be computed (see figure to right).  The probability of accepting a production where the actual recall is less than 70% will be very low, and the probability of rejecting a production where the actual recall is greater than 80% will also be low — this comes from the fact that a sample of 400 responsive documents is sufficient to measure recall to within +/- 5%.

multistage_acceptance_procedureThe idea behind the new method is to achieve the same probability profile for accepting/rejecting a production using a multi-stage acceptance test.  The multi-stage test gives the possibility of stopping the process and declaring the production accepted/rejected long before reviewing 400 random responsive documents.  The procedure is shown in the flowchart to the right (click to enlarge).  A decision may be reached after reviewing enough documents to find just 25 random documents that are responsive.  If a decision isn’t made after reviewing 25 responsive documents, review continues until 50 responsive documents are found and another test is applied.  At worst, documents will be reviewed until 400 responsive documents are found (the same as the traditional direct recall estimation method).

multistage_barriers_85_recall_pathsThe figure to the right shows six examples of the multi-stage acceptance test being applied when the actual recall is 85%.  Since 85% is well above the 80% upper bound of the 75% +/- 5% range, we expect this production to virtually always be accepted.  The figure shows that acceptance can occur long before reviewing a full 400 random responsive documents.  The number of random responsive documents reviewed is shown on the vertical axis.  Toward the bottom of the graph the sample is very small and the percentage of the sample that has been produced may deviate greatly from the right answer of 85%.  As you go up the sample gets larger and the proportion of the sample that is produced is expected to get closer to 85%.  When a green decision boundary is touched, causing the production to be accepted as having sufficiently high recall, the color of the remainder of the path is changed to yellow — the yellow part represents the document review that is avoided by using the multi-stage acceptance method (since the traditional direct recall measurement would involve going all the way to 400 responsive documents).  As you can see, when the actual recall is 85% the number of random responsive documents that must be reviewed is often 50 or 100, not 400.

multistage_effort_for_multistageThe figure to the right shows the average number of documents that must be reviewed using the multi-stage acceptance procedure from the earlier flowchart.  The amount of review required can be much less than 400 random responsive documents.  In fact, the further above/below the 75% target (called the “splitting recall” in the paper) the actual recall is, the less document review is required (on average) to come to a conclusion about whether the production’s recall is high enough.  This creates an incentive for the producing party to aim for recall that is well above the minimum acceptable level since it will be rewarded with a reduced amount of document review to confirm the result is adequate.

It is important to note that the multi-stage procedure provides an accept/reject result, not a recall estimate.  If you follow the procedure until an accept/reject boundary is hit and then use the proportion of the sample that was produced as a recall estimate, that estimate will be biased (the use of “unbiased” in the paper title refers to the sampling being done on the full population, not on a subset [such as the discard set] that would cause a bias due to inconsistency in review of different subsets).

You may want to use a splitting recall other than 75% for the accept/reject decision — the full paper provides tables of values necessary for doing that.

5 thoughts on “Substantial Reduction in Review Effort Required to Demonstrate Adequate Recall

  1. Matthew

    Fascinating insights, Bill. I hope you’ll provide some thoughts on the DESI VII workshop for those of us who won’t be able to attend.

    A few queries:
    1) In your model, is the assumption that the producing party is conducting the sampling?
    2) Is it also assumed that the sampling is against responsive but not sensitive (ie not privileged, not confidential and unredacted) documents?
    3) Is there also an assumption that an entire family of documents (ie host email, and attachments) are produced? ie what happens if an attachment is produced that isn’t responsive, but is produced as the host is responsive, and then the non-responsive attachment is included (but not its responsive host) in the sample set?
    4) Does the type/form of document (email, spreadsheet, image, document, presentation) have an influence on the sample size – my point being I have a suspicion that different types of engines that are used perform differently depending upon the type of document.


    1. Bill Dimm


      I’ll post some sort of summary from DESI, but I’ll probably only cover things that are not already covered in the papers posted on the DESI site. Regarding your other questions:

      1) The math doesn’t make any assumptions about who is doing the sampling, but as a practical matter it would probably be the producing party since it would involve review of a lot of non-responsive documents (in order to find the responsive ones) that the producing party would probably not want to disclose.

      2) You could do it either way. If you exclude priv docs (I’m not sure redacted should be lumped in with priv here), you’re testing the percentage of docs you were required to produce that you actually produced, which is certainly sensible. If you include priv docs that are responsive, you’re testing the percentage of all responsive (even if priv) docs that were correctly identified (e.g., via predictive coding or keyword search) as being responsive, which is perhaps slightly less relevant. The two percentages should be the same unless there is correlation between priv and your ability to identify responsive documents.

      3) This is really a matter of choice — you could either sample individual documents or you could sample individual families (treat the family as being produced if any document within the family is responsive). In other words, you could be testing recall at the document level or at the family level. My terminology is perhaps a little confusing because I talk about recall being the proportion of responsive documents that were produced — I used “produced” rather than “predicted to be responsive” to avoid implying that this method only applies to predictive coding (or TAR) — it could also be used if documents were selected based on keyword search. Unfortunately, “produced” is probably not the clearest term to use if you’re talking about families. If you want to work with families, just think of sampling as choosing families instead of choosing documents.

      4) This depends on what your goal is. If your goal is “I want to produce at least X% of responsive documents in total, regardless of type,” then you just ignore the document type when sampling. On the other hand, if your goal is “I want to produce at least Y% of responsive documents of each type” in order to avoid a production that is potentially over-representing some document types at the expense of others, then you have to apply the procedure to each document type separately (which means more document review). You could argue that it is reasonable for Y to be somewhat smaller than X above, which could reduce the burden somewhat (since the amount of sampling will decrease the farther above the target you are).

      1. Matthew

        Thanks for the speedy responses, Bill.

        When I read the paper, I’d assumed that it was an additional model to be put in place after production, hence my question about which party would be using it. So you’re envisaging that this is put in place as a final ‘sanity check’ against the proposed production, and so may be put in place on top of other rounds of TAR.

      2. Bill Dimm


        Yes, this is something you would do after identifying the documents that will be produced (via TAR or other method) in order to confirm that the recall you will achieve is high enough. I would avoid calling it a model since that could be taken to imply that it’s leveraging some set of assumptions (the model) to improve efficiency. That’s not the case — this is not making any assumptions beyond those that are made when you estimate recall by taking a random sample of 400 responsive docs. It is, however, doing something slightly different — instead of estimating recall and making an accept/reject decision based on the estimate, it skips directly to the accept/reject decision.

Leave a Reply