Tag Archives: technology-assisted review

Comments on Pyrrho Investments v. MWB Property and TAR vs. Manual Review

A recent decision by Master Matthews in Pyrrho Investments v. MWB Property seems to be the first judgment by a UK court allowing the use of predictive coding.  This article comments on a few aspects of the decision, especially the conclusion about how predictive coding (or TAR) performs compared to manual review.

The decision argues that predictive coding is not prohibited by English law and that it is reasonable based on proportionality, the details of the case, and expected accuracy compared to manual review.  It recaps the Da Silva Moore v. Publicis Group case from the US starting at paragraph 26, and the Irish Bank Resolution Corporation v. Quinn case from Ireland starting at paragraph 31.

Paragraph 33 enumerates ten reasons for approving predictive coding.  The second reason on the list is:

There is no evidence to show that the use of predictive coding software leads to less accurate disclosure being given than, say, manual review alone or keyword searches and manual review combined, and indeed there is some evidence (referred to in the US and Irish cases to which I referred above) to the contrary.

The evidence referenced includes the famous Grossman & Cormack JOLT study, but that study only analyzed the TAR systems from TREC 2009 that had the best results.  If you look at all of the TAR results from TREC 2009, as I did in Appendix A of my book, many of the TAR systems found fewer relevant documents (albeit at much lower cost) than humans performing manual review. This figure shows the number of relevant documents found:

Number of relevant documents found for five categorization tasks. The vertical scale always starts at zero. Manual review by humans is labeled "H." TAR systems analyzed by Grossman and Cormack are "UW" and "H5." Error bars are 95% confidence intervals.

Number of relevant documents found for five categorization tasks. The vertical scale always starts at zero. Manual review by humans is labeled “H.” TAR systems analyzed by Grossman and Cormack are “UW” and “H5.” Error bars are 95% confidence intervals.

If a TAR system generates relevance scores rather than binary yes/no relevance predictions, any desired recall can be achieved by producing all documents having relevance scores above an appropriately calculated cutoff.  Aiming for high recall with a system that is not working well may mean producing a lot of non-relevant documents or performing a lot of human review on the documents predicted to be relevant (i.e., documents above the relevance score cutoff) to filter out the large number of non-relevant documents that the system failed to separate from the relevant ones (possibly losing some relevant documents in the process due to reviewer mistakes).  If it is possible (through enough effort) to achieve high recall with a system that is performing poorly, why were so many TAR results far below the manual review results?  TREC 2009 participants were told they should aim to maximize their F1 scores (F1 is not a good choice for e-discovery).  Effectively, participants were told to choose their relevance score cutoffs in a way that tried to balance the desire for high recall with other concerns (high precision).  If a system wasn’t performing well, maximizing F1 meant either accepting low recall or reviewing a huge number of documents to achieve high recall without allowing too many non-relevant documents to slip into the production.

The key point is that the number of relevant documents found depends on how the system is used (e.g., how the relevance score cutoff is chosen).  The amount of effort required (amount of human document review) to achieve a desired level of recall depends on how well the system and training methodology work, which can vary quite a bit (see this article).  Achieving results that are better than manual review (in terms of the number of relevant documents found) does not happen automatically just because you wave the word “TAR” around.  You either need a system that works well for the task at hand, or you need to be willing to push a poor system far enough (low relevance score cutoff and lots of document review) to achieve good recall.  The figure above should make it clear that it is possible for TAR to give results that fall far short of manual review if it is not pushed hard enough.

The discussion above focuses on the quality of the result, but the cost of achieving the result is obviously a significant factor.  Page 14 of the decision says the case involves over 3 million documents and the cost of the predictive coding software is estimated to be between £181,988 and £469,049 (plus hosting costs) depending on factors like the number of documents culled via keyword search.  If we assume the high end of the price range applies to 3 million documents, that works out to $0.22 per document, which is about ten times what it could be if they shopped around, but still much cheaper than human review.

TAR 3.0 Performance

This article reviews TAR 1.0, 2.0, and the new TAR 3.0 workflow.  It then compares performance on seven categorization tasks of varying prevalence and difficulty.  You may find it useful to read my article on gain curves before reading this one.

In some circumstances it may be acceptable to produce documents without reviewing all of them.  Perhaps it is expected that there are no privileged documents among the custodians involved, or maybe it is believed that potentially privileged documents will be easy to find via some mechanism like analyzing email senders and recipients.  Maybe there is little concern that trade secrets or evidence of bad acts unrelated to the litigation will be revealed if some non-relevant documents are produced.  In such situations you are faced with a dilemma when choosing a predictive coding workflow.  The TAR 1.0 workflow allows documents to be produced without review, so there is potential for substantial savings if TAR 1.0 works well for the case in question, but TAR 1.0 sometimes doesn’t work well, especially when prevalence is low.  TAR 2.0 doesn’t really support producing documents without reviewing them, but it is usually much more efficient than TAR 1.0 if all documents that are predicted to be relevant will be reviewed, especially if the task is difficult or prevalence is low.

TAR 1.0 involves a fair amount of up-front investment in reviewing control set documents and training documents before you can tell whether it is going to work well enough to produce a substantial number of documents without reviewing them.  If you find that TAR 1.0 isn’t working well enough to avoid reviewing documents that will be produced (too many non-relevant documents would slip into the production) and you resign yourself to reviewing everything that is predicted to be relevant, you’ll end up reviewing more documents with TAR 1.0 than you would have with TAR 2.0.  Switching from TAR 1.0 to TAR 2.0 midstream is less efficient than starting with TAR 2.0. Whether you choose TAR 1.0 or TAR 2.0, it is possible that you could have done less document review if you had made the opposite choice (if you know up front that you will have to review all documents that will be produced due to the circumstances of the case, TAR 2.0 is almost certainly the better choice as far as efficiency is concerned).

TAR 3.0 solves the dilemma by providing high efficiency regardless of whether or not you end up reviewing all of the documents that will be produced.  You don’t have to guess which workflow to use and suffer poor efficiency if you are wrong about whether or not producing documents without reviewing them will be feasible.  Before jumping into the performance numbers, here is a summary of the workflows (you can find some related animations and discussion in the recording of my recent webinar):

TAR 1.0 involves a training phase followed by a review phase with a control set being used to determine the optimal point when you should switch from training to review.  The system no longer learns once the training phase is completed.  The control set is a random set of documents that have been reviewed and marked as relevant or non-relevant.  The control set documents are not used to train the system.  They are used to assess the system’s predictions so training can be terminated when the benefits of additional training no longer outweigh the cost of additional training.  Training can be with randomly selected documents, known as Simple Passive Learning (SPL), or it can involve documents chosen by the system to optimize learning efficiency, known as Simple Active Learning (SAL).

TAR 2.0 uses an approach called Continuous Active Learning (CAL), meaning that there is no separation between training and review–the system continues to learn throughout.  While many approaches may be used to select documents for review, a significant component of CAL is many iterations of predicting which documents are most likely to be relevant, reviewing them, and updating the predictions.  Unlike TAR 1.0, TAR 2.0 tends to be very efficient even when prevalence is low.  Since there is no separation between training and review, TAR 2.0 does not require a control set.  Generating a control set can involve reviewing a large (especially when prevalence is low) number of non-relevant documents, so avoiding control sets is desirable.

TAR 3.0 requires a high-quality conceptual clustering algorithm that forms narrowly focused clusters of fixed size in concept space.  It applies the TAR 2.0 methodology to just the cluster centers, which ensures that a diverse set of potentially relevant documents are reviewed.  Once no more relevant cluster centers can be found, the reviewed cluster centers are used as training documents to make predictions for the full document population.  There is no need for a control set–the system is well-trained when no additional relevant cluster centers can be found. Analysis of the cluster centers that were reviewed provides an estimate of the prevalence and the number of non-relevant documents that would be produced if documents were produced based purely on the predictions without human review.  The user can decide to produce documents (not identified as potentially privileged) without review, similar to SAL from TAR 1.0 (but without a control set), or he/she can decide to review documents that have too much risk of being non-relevant (which can be used as additional training for the system, i.e., CAL).  The key point is that the user has the info he/she needs to make a decision about how to proceed after completing review of the cluster centers that are likely to be relevant, and nothing done before that point becomes invalidated by the decision (compare to starting with TAR 1.0, reviewing a control set, finding that the predictions aren’t good enough to produce documents without review, and then switching to TAR 2.0, which renders the control set virtually useless).

The table below shows the amount of document review required to reach 75% recall for seven categorization tasks with widely varying prevalence and difficulty.  Performance differences between CAL and non-CAL approaches tend to be larger if a higher recall target is chosen.  The document population is 100,000 news articles without dupes or near-dupes.  “Min Total Review” is the number of documents requiring review (training documents and control set if applicable) if all documents predicted to be relevant will be produced without review.  “Max Total Review” is the number of documents requiring review if all documents predicted to be relevant will be reviewed before production.  None of the results include review of statistical samples used to measure recall, which would be the same for all workflows.

Task 1 2 3 4 5 6 7
Prevalence 6.9% 4.1% 2.9% 1.1% 0.68% 0.52% 0.32%
TAR 1.0 SPL Control Set 300 500 700 1,800 3,000 3,900 6,200
Training (Random) 1,000 300 6,000 3,000 1,000 4,000 12,000
Review Phase 9,500 4,400 9,100 4,400 900 9,800 2,900
Min Total Review 1,300 800 6,700 4,800 4,000 7,900 18,200
Max Total Review 10,800 5,200 15,800 9,200 4,900 17,700 21,100
TAR 3.0 SAL Training (Cluster Centers) 400 500 600 300 200 500 300
Review Phase 8,000 3,000 12,000 4,200 900 8,000 7,300
Min Total Review 400 500 600 300 200 500 300
Max Total Review 8,400 3,500 12,600 4,500 1,100 8,500 7,600
TAR 3.0 CAL Training (Cluster Centers) 400 500 600 300 200 500 300
Training + Review 7,000 3,000 6,700 2,400 900 3,300 1,400
Total Review 7,400 3,500 7,300 2,700 1,100 3,800 1,700
tar3_min_review

Producing documents without review with TAR 1.0 sometimes results in much less document review than using TAR 2.0 (which requires reviewing everything that will be produced), but sometimes TAR 2.0 requires less review.

tar3_max_review

The size of the control set for TAR 1.0 was chosen so that it would contain approximately 20 relevant documents, so low prevalence requires a large control set.  Note that the control set size was chosen based on the assumption that it would be used only to measure changes in prediction quality.  If the control set will be used for other things, such as recall estimation, it needs to be larger.

The number of random training documents used in TAR 1.0 was chosen to minimize the Max Total Review result (see my article on gain curves for related discussion).  This minimizes total review cost if all documents predicted to be relevant will be reviewed and if the cost of reviewing documents in the training phase and review phase are the same.  If training documents will be reviewed by an expensive subject matter expert and the review phase will be performed by less expensive reviewers, the optimal amount of training will be different.  If documents predicted to be relevant won’t be reviewed before production, the optimal amount of training will also be different (and more subjective), but I kept the training the same when computing Min Total Review values.

The optimal number of training documents for TAR 1.0 varied greatly for different tasks, ranging from 300 to 12,000.  This should make it clear that there is no magic number of training documents that is appropriate for all projects.  This is also why TAR 1.0 requires a control set–the optimal amount of training must be measured.

The results labeled TAR 3.0 SAL come from terminating learning once the review of cluster centers is complete, which is appropriate if documents will be produced without review (Min Total Review).  The Max Total Review value for TAR 3.0 SAL tells you how much review would be required if you reviewed all documents predicted to be relevant but did not allow the system to learn from that review, which is useful to compare to the TAR 3.0 CAL result where learning is allowed to continue throughout.  In some cases where the categorization task is relatively easy (tasks 2 and 5) the extra learning from CAL has no benefit unless the target recall is very high.  In other cases CAL reduces review significantly.

I have not included TAR 2.0 in the table because the efficiency of TAR 2.0 with a small seed set (a single relevant document is enough) is virtually indistinguishable from the TAR 3.0 CAL results that are shown.  Once you start turning the CAL crank the system will quickly head toward the relevant documents that are easiest for the classification algorithm to identify, and feeding those documents back in for training quickly floods out the influence of the seed set you started with.  The only way to change the efficiency of CAL, aside from changing the software’s algorithms, is to waste time reviewing a large seed set that is less effective for learning than the documents that the algorithm would have chosen itself.  The training done by TAR 3.0 with cluster centers is highly effective for learning, so there is no wasted effort in reviewing those documents.

To illustrate the dilemma I pointed out at the beginning of the article, consider task 2.  The table shows that prevalence is 4.1%, so there are 4,100 relevant documents in the population of 100,000 documents.  To achieve 75% recall, we would need to find 3,075 relevant documents.  Some of the relevant documents will be found in the control set and the training set, but most will be found in the review phase.  The review phase involves 4,400 documents.  If we produce all of them without review, most of the produced documents will be relevant (3,075 out of a little more than 4,400).  TAR 1.0 would require review of only 800 documents for the training and control sets.  By contrast, TAR 2.0 (I’ll use the Total Review value for TAR 3 CAL as the TAR 2.0 result) would produce 3,075 relevant documents with no non-relevant ones (assuming no mistakes by the reviewer), but it would involve reviewing 3,500 documents.  TAR 1.0 was better than TAR 2.0 in this case (if producing over a thousand non-relevant documents is acceptable).  TAR 3.0 would have been an even better choice because it required review of only 500 documents (cluster centers) and it would have produced fewer non-relevant documents since the review phase would involve only 3,000 documents.

Next, consider task 6.  If all 9,800 documents in the review phase of TAR 1.0 were produced without review, most of the production would be non-relevant documents since there are only 520 relevant documents (prevalence is 0.52%) in the entire population!  That shameful production would occur after reviewing 7,900 documents for training and the control set, assuming you didn’t recognize the impending disaster and abort before getting that far.  Had you started with TAR 2.0, you could have had a clean (no non-relevant documents) production after reviewing just 3,800 documents.  With TAR 3.0 you would realize that producing documents without review wasn’t feasible after reviewing 500 cluster center documents and you would proceed with CAL, reviewing a total of 3,800 documents to get a clean production.

Task 5 is interesting because production without review is feasible (but not great) with respect to the number of non-relevant documents that would be produced, but TAR 1.0 is so inefficient when prevalence is low that you would be better off using TAR 2.0.  TAR 2.0 would require reviewing 1,100 documents for a clean production, whereas TAR 1.0 would require reviewing 3,000 documents for just the control set!  TAR 3.0 beats them both, requiring review of just 200 cluster centers for a somewhat dirty production.

It is worth considering how the results might change with a larger document population.  If everything else remained the same (prevalence and difficulty of the categorization task), the size of the control set required would not change, and the number of training documents required would probably not change very much, but the number of documents involved in the review phase would increase in proportion to the size of the population, so the cost savings from being able to produce documents without reviewing them would be much larger.

In summary, TAR 1.0 gives the user the option to produce documents without reviewing them, but its efficiency is poor, especially when prevalence is low.  Although the number of training documents required for TAR 1.0 when prevalence is low can be reduced by using active learning (not examined in this article) instead of documents chosen randomly for training, TAR 1.0 is still stuck with the albatross of the control set dragging down efficiency.  In some cases (tasks 5, 6, and 7) the control set by itself requires more review labor than the entire document review using CAL.  TAR 2.0 is vastly more efficient than TAR 1.0 if you plan to review all of the documents that are predicted to be relevant, but it doesn’t provide the option to produce documents without reviewing them.  TAR 3.0 borrows some of best aspects of both TAR 1.0 and 2.0.  When all documents that are candidates for production will be reviewed, TAR 3.0 with CAL is just as efficient as TAR 2.0 and has the added benefits of providing a prevalence estimate and a diverse early view of relevant documents.  When it is permissible to produce some documents without reviewing them, TAR 3.0 provides that capability with much better efficiency than TAR 1.0 due to its efficient training and elimination of the control set.

If you like graphs, the gain curves for all seven tasks are shown below.  Documents used for training are represented by solid lines, and documents not used for training are shown as dashed lines.  Dashed lines represent documents that could be produced without review if that is appropriate for the case.  A green dot is placed at the end of the review of cluster centers–this is the point where the TAR 3.0 SAL and TAR 3.0 CAL curves diverge, but sometimes they are so close together that it is hard to distinguish them without the dot.  Note that review of documents for control sets is not reflected in the gain curves, so the TAR 1.0 results require more document review than is implied by the curves.

Task 1. Prevalence is 6.9%.

Task 1. Prevalence is 6.9%.

Task 2. Prevalence is 4.1%.

Task 2. Prevalence is 4.1%.

Task 3. Prevalence is 2.9%.

Task 3. Prevalence is 2.9%.

Task 4. Prevalence is 1.1%.

Task 4. Prevalence is 1.1%.

Task 5. Prevalence is 0.68%.

Task 5. Prevalence is 0.68%.

Task 6. Prevalence is 0.52%.

Task 6. Prevalence is 0.52%.

Task 7. Prevalence is 0.32%.

Task 7. Prevalence is 0.32%.

 

Disclosing Seed Sets and the Illusion of Transparency

There has been a great deal of debate about whether it is wise or possibly even required to disclose seed sets (training documents, possibly including non-relevant documents) when using predictive coding.  This article explains why disclosing seed sets may provide far less transparency than people think.

seed_sproutThe rationale for disclosing seed sets seems to be that the seed set is the input to the predictive coding system that determines which documents will be produced, so it is reasonable to ask for it to be disclosed so the requesting party can be assured that they will get what they wanted, similar to asking for a keyword search query to be disclosed.

Some argue that the seed set may be work product (if attorneys choose which documents to include rather than using random sampling).  Others argue that disclosing non-relevant training documents may reveal a bad act other than the one being litigated.  If the requesting party is a  competitor, the non-relevant training documents may reveal information that helps them compete.  Even if the producing party is not concerned about any of the issues above, it may be reluctant to disclose the seed set due to fear of establishing a precedent it may not want to be stuck with in future cases having different circumstances.

Other people are far more qualified to debate the legal and strategic issues than I am.  Before going down that road, I think it’s worthwhile to consider whether disclosing seed sets really provides the transparency that people think.  Some reasons why it does not:

  1. If you were told that the producing party would be searching for evidence of data destruction by doing a keyword search for “shred AND documents,” you could examine that query and easily spot deficiencies.  A better search might be “(shred OR destroy OR discard OR delete) AND (documents OR files OR records OR emails OR evidence).”  Are you going to review thousands of training documents and realize that one relevant training document contains the words “shred” and “documents” but none of the training documents contain “destroy” or “discard” or “files”?  I doubt it.
  2. You cannot tell whether the seed set is sufficient if you don’t have access to the full document population.  There could be substantial pockets of important documents that are not represented in the seed set–how would you know?  The producing party has access to the full population, so they can do statistical sampling to measure the quality (based on number of relevant documents, not their importance) of the predictions the training set will produce.  The requesting party cannot do that–they have no way of assessing adequacy of the training set other than wild guessing.
  3. You cannot tell whether the seed set is biased just by looking at it.  Again, if you don’t have access to the full population, how could you know if some topic or some particular set of keywords is under or over represented?  If training documents were selected by searching for “shred AND Friday,” the system would see both words on all (or most) of the relevant documents and would think both words are equally good indicators of relevance.  Would you notice that all the relevant training documents happen to contain the word “Friday”?  I doubt it.
  4. Suppose you see an important document in the seed set that was correctly tagged as being relevant.  Can you rest assured that similar documents will be produced?  Maybe not.  Some classification algorithms can predict a document to be non-relevant when it is a near-dupe or even an exact dupe of a relevant training document.  I described how that could happen in this article.  How can you claim that the seed set provides transparency if you don’t even know if a near-dupe of a relevant training document will be produced?
  5. Poor training doesn’t necessarily mean that relevant documents will be missed.  If a relevant document fails to match a keyword search query, it will be missed, so ensuring that the query is good is important.  Most predictive coding systems generate a relevance score for each document, not just a binary yes/no relevance prediction like a search query.  Whether or not the predictive coding system produces a particular relevant document doesn’t depend solely on the training set–the producing party must choose a cutoff point in the ranked document list that determines which documents will be produced.  A poorly trained system can still achieve high recall if the relevance score cutoff is chosen to be low enough.  If the producing party reviews all documents above the relevance score cutoff before producing them, a poorly trained system will require a lot more document review to achieve satisfactory recall.  Unless there is talk of cost shifting, or the producing party is claiming it should be allowed to stop at modest recall because reaching high recall would be too expensive, is it really the requesting party’s concern if the producing party incurs high review costs by training the system poorly?
  6. One might argue that the producing party could stack the seed set with a large number of marginally relevant documents while avoiding really incriminating documents in order to achieve acceptable recall while missing the most important documents.  Again, would you be able to tell that this was done by merely examining the seed set without having access to the full population?  Is the requesting party going to complain that there is no smoking gun in the training set?  The producing party can simply respond that there are no smoking guns in the full population.
  7. The seed set may have virtually no impact on the final result.  To appreciate this point we need to be more specific about what the seed set is, since people use the term in many different ways (see Grossman & Cormack’s discussion).  If the seed set is taken to be a judgmental sample (documents selected by a human, perhaps using keyword search) that is followed by several rounds of additional training using active learning, the active learning algorithm is going to have a much larger impact on the final result than the seed set if active learning contributes a much larger number of relevant documents to the training.  In fact, the seed set could be a single relevant document and the result would have almost no dependence on which relevant document was used as the seed (see the “How Seed Sets Influence Which Documents are Found” section of this article).  On the other hand, if you take a much broader definition of the seed set and consider it to be all documents used for training, things get a little strange if continuous active learning (CAL) is used.  With CAL the documents that are predicted to be relevant are reviewed and the reviewers’ assessments are fed back into the system as additional training to generate new predictions.  This is iterated many times.  So all documents that are reviewed are used as training documents.  The full set of training documents for CAL would be all of the relevant documents that are produced as well as all non-relevant documents that were reviewed along the way.  Disclosing the full set of training documents for CAL could involve disclosing a very large number of non-relevant documents (comparable to the number of relevant documents produced).

Trying to determine whether a production will be good by examining a seed set that will be input into a complex piece of software to analyze a document population that you cannot access seems like a fool’s errand.  It makes more sense to ask the producing party what recall it achieved and to ask questions to ensure that recall was measured sensibly.  Recall isn’t the whole story–it measures the number of relevant documents found, not their importance.  It makes sense to negotiate the application of a few keyword searches to the documents that were culled (predicted to be non-relevant) to ensure that nothing important was missed that could easily have been found.  The point is that you should judge the production by analyzing the system’s output, not the training data that was input.

Comments on Rio Tinto v. Vale and Sample Size

Judge Peck recently issued an opinion in Rio Tinto PLC v. Vale SA, et al, Case 1:14-cv-03042-RMB-AJP where he spent some time reflecting on the state of court acceptance of technology-assisted review (a.k.a. predictive coding).  The quote that will surely grab headlines is on page 2: “In the three years since Da Silva Moore, the case law has developed to the point that it is now black letter law that where the producing party wants to utilize TAR for document review, courts will permit it.”  He lists the relevant cases and talks a bit about transparency and disclosing seed sets.  It is certainly worth reading.

Both parties in Rio Tinto v. Vale have agreed to disclose all non-privileged documents, including non-responsive documents, from their control sets, seed sets, and training sets. Judge Peck accepts their protocol because they both agree to it, but hints that disclosing seed sets may not really be necessary (p. 6, “…requesting parties can insure that training and review was done appropriately by other means…”).

I find one other aspect of the protocol the litigants proposed to be worthy of comment.  They make a point of defining a “Statistically Valid Sample” on p. 11 to be one that gives +/- 2% margin of error at 95% confidence, and even provide an equation to compute the sample size in footnote 2.  Their equation gives a sample size of at most 2,395 documents, depending on prevalence.  They then use the “Statistically Valid Sample” term in contexts where it isn’t (as they’ve defined it) directly appropriate.  I don’t know if this is just sloppiness (missing details about what they actually plan to do) or a misunderstanding of statistics.

For example, section 4.a.ii on p. 13 contemplates culling before application of predictive coding, and says they will “Review a Statistically Valid Sample from the Excluded Documents.”  Kudos to them for actually measuring how many relevant documents they are culling instead of just assuming that keyword search results should be good enough without any analysis, but 2,395 documents is not the right sample size.  The more documents you are culling, the more precisely you need to know what proportion of them were relevant in order to have a reasonably precise value for the number of relevant documents culled, which is what matters for computing recall.  In other words, a +/- 2% measurement on the culled set does not mean +/- 2% for recall.  I described a similar situation in more detail in my Predictive Coding Confusion article under the heading “Beware small percentages of large numbers.”  My eRecall: No Free Lunch article also discusses similar issues.

Section 4.b on p. 13 says that the control set will be a Statistically Valid Sample that will be used to measure prevalence.  They explain in a separate letter to Judge Peck on p. 9 that the control set will be used to track progress by estimating precision and recall. Do they intend to use 2,395 (or fewer) documents for the control set?  Suppose only one of the 2,395 documents is actually relevant.  That would give a prevalence estimate of 0.0011% to 0.2321% with 95% confidence (via this calculator), which is certainly better than the required +/- 2%, but it is useless for tracking progress because the uncertainty is huge compared to the value itself.  If they had a million documents, the estimate would tell them that somewhere between 11 and 2,321 of them are relevant.  So, if they found 11 relevant documents with their predictive coding software they would estimate that they achieved somewhere between 0.5% and 100% recall.  To look at it a little differently, if they looked at their system’s prediction for the control set they would find that it either correctly predicted that the one relevant document was relevant (100% recall) or they would find that it was predicted incorrectly (0% recall), with dumb luck being a big factor in which result they got.

Maybe they intended that the control set contain 2,395 relevant documents, which would give a recall estimate accurate to +/- 2% with 95% confidence (more precise than really seems worthwhile for a control set) by measuring the percentage of relevant documents in the control set that are predicted correctly.  If prevalence is 10%, the control set would need to contain about 23,950 documents to have 2,395 that are relevant.  If prevalence is 1%, the control set would require about 239,500 documents.  That sure seems like a lot of documents to review just to create a control set.  The point is that it is the number of relevant documents in the control set, not the number of documents, that determines how precisely the control set can measure recall.  Their protocol does say that the requesting party will have ten business days to check the control set if it is more than 4,000 documents, so it does seem that they’ve contemplated the possibility of using more than 2,395 documents in the control set, but the details of what they are really planning to do are missing.  Of course, the control set is there to help the producing party optimize their process, so it is their loss if they get it wrong (assuming there is separate testing that would detect the problem, as described in section 4.f).

Finally, section 4.f on p. 16 talks about taking a Statistically Valid Sample from the documents that are predicted to be non-relevant to estimate the number of relevant documents that were missed by predictive coding, leading to a recall estimate.  This has the same problem as the culling in section 4.a.ii — the size of the sample that is required to achieve a desired level of uncertainty in the recall depends on the size of the set of documents being culled, whether the culling is due to keyword searching before applying predictive coding or whether the culling is due to discarding documents that the predictive coding system predicts are non-relevant.

If the goal is to arrive at a reasonably precise estimate of recall (and, I’m certainly not arguing that +/- 2% should be required), it is important to keep track of how the uncertainty from each sample propagates through to the final recall result (e.g. it may be multiplied by some large number of culled documents) when choosing an appropriate sample size.  I may be nitpicking, but it strikes me as odd to lay out a specific formula for calculating sample size and then not mention that it cannot be applied directly for the sampling that is actually being contemplated.