Monthly Archives: March 2015

Comments on Rio Tinto v. Vale and Sample Size

Judge Peck recently issued an opinion in Rio Tinto PLC v. Vale SA, et al, Case 1:14-cv-03042-RMB-AJP where he spent some time reflecting on the state of court acceptance of technology-assisted review (a.k.a. predictive coding).  The quote that will surely grab headlines is on page 2: “In the three years since Da Silva Moore, the case law has developed to the point that it is now black letter law that where the producing party wants to utilize TAR for document review, courts will permit it.”  He lists the relevant cases and talks a bit about transparency and disclosing seed sets.  It is certainly worth reading.

Both parties in Rio Tinto v. Vale have agreed to disclose all non-privileged documents, including non-responsive documents, from their control sets, seed sets, and training sets. Judge Peck accepts their protocol because they both agree to it, but hints that disclosing seed sets may not really be necessary (p. 6, “…requesting parties can insure that training and review was done appropriately by other means…”).

I find one other aspect of the protocol the litigants proposed to be worthy of comment.  They make a point of defining a “Statistically Valid Sample” on p. 11 to be one that gives +/- 2% margin of error at 95% confidence, and even provide an equation to compute the sample size in footnote 2.  Their equation gives a sample size of at most 2,395 documents, depending on prevalence.  They then use the “Statistically Valid Sample” term in contexts where it isn’t (as they’ve defined it) directly appropriate.  I don’t know if this is just sloppiness (missing details about what they actually plan to do) or a misunderstanding of statistics.

For example, section 4.a.ii on p. 13 contemplates culling before application of predictive coding, and says they will “Review a Statistically Valid Sample from the Excluded Documents.”  Kudos to them for actually measuring how many relevant documents they are culling instead of just assuming that keyword search results should be good enough without any analysis, but 2,395 documents is not the right sample size.  The more documents you are culling, the more precisely you need to know what proportion of them were relevant in order to have a reasonably precise value for the number of relevant documents culled, which is what matters for computing recall.  In other words, a +/- 2% measurement on the culled set does not mean +/- 2% for recall.  I described a similar situation in more detail in my Predictive Coding Confusion article under the heading “Beware small percentages of large numbers.”  My eRecall: No Free Lunch article also discusses similar issues.

Section 4.b on p. 13 says that the control set will be a Statistically Valid Sample that will be used to measure prevalence.  They explain in a separate letter to Judge Peck on p. 9 that the control set will be used to track progress by estimating precision and recall. Do they intend to use 2,395 (or fewer) documents for the control set?  Suppose only one of the 2,395 documents is actually relevant.  That would give a prevalence estimate of 0.0011% to 0.2321% with 95% confidence (via this calculator), which is certainly better than the required +/- 2%, but it is useless for tracking progress because the uncertainty is huge compared to the value itself.  If they had a million documents, the estimate would tell them that somewhere between 11 and 2,321 of them are relevant.  So, if they found 11 relevant documents with their predictive coding software they would estimate that they achieved somewhere between 0.5% and 100% recall.  To look at it a little differently, if they looked at their system’s prediction for the control set they would find that it either correctly predicted that the one relevant document was relevant (100% recall) or they would find that it was predicted incorrectly (0% recall), with dumb luck being a big factor in which result they got.

Maybe they intended that the control set contain 2,395 relevant documents, which would give a recall estimate accurate to +/- 2% with 95% confidence (more precise than really seems worthwhile for a control set) by measuring the percentage of relevant documents in the control set that are predicted correctly.  If prevalence is 10%, the control set would need to contain about 23,950 documents to have 2,395 that are relevant.  If prevalence is 1%, the control set would require about 239,500 documents.  That sure seems like a lot of documents to review just to create a control set.  The point is that it is the number of relevant documents in the control set, not the number of documents, that determines how precisely the control set can measure recall.  Their protocol does say that the requesting party will have ten business days to check the control set if it is more than 4,000 documents, so it does seem that they’ve contemplated the possibility of using more than 2,395 documents in the control set, but the details of what they are really planning to do are missing.  Of course, the control set is there to help the producing party optimize their process, so it is their loss if they get it wrong (assuming there is separate testing that would detect the problem, as described in section 4.f).

Finally, section 4.f on p. 16 talks about taking a Statistically Valid Sample from the documents that are predicted to be non-relevant to estimate the number of relevant documents that were missed by predictive coding, leading to a recall estimate.  This has the same problem as the culling in section 4.a.ii — the size of the sample that is required to achieve a desired level of uncertainty in the recall depends on the size of the set of documents being culled, whether the culling is due to keyword searching before applying predictive coding or whether the culling is due to discarding documents that the predictive coding system predicts are non-relevant.

If the goal is to arrive at a reasonably precise estimate of recall (and, I’m certainly not arguing that +/- 2% should be required), it is important to keep track of how the uncertainty from each sample propagates through to the final recall result (e.g. it may be multiplied by some large number of culled documents) when choosing an appropriate sample size.  I may be nitpicking, but it strikes me as odd to lay out a specific formula for calculating sample size and then not mention that it cannot be applied directly for the sampling that is actually being contemplated.