Monthly Archives: February 2016

Comments on Pyrrho Investments v. MWB Property and TAR vs. Manual Review

A recent decision by Master Matthews in Pyrrho Investments v. MWB Property seems to be the first judgment by a UK court allowing the use of predictive coding.  This article comments on a few aspects of the decision, especially the conclusion about how predictive coding (or TAR) performs compared to manual review.

The decision argues that predictive coding is not prohibited by English law and that it is reasonable based on proportionality, the details of the case, and expected accuracy compared to manual review.  It recaps the Da Silva Moore v. Publicis Group case from the US starting at paragraph 26, and the Irish Bank Resolution Corporation v. Quinn case from Ireland starting at paragraph 31.

Paragraph 33 enumerates ten reasons for approving predictive coding.  The second reason on the list is:

There is no evidence to show that the use of predictive coding software leads to less accurate disclosure being given than, say, manual review alone or keyword searches and manual review combined, and indeed there is some evidence (referred to in the US and Irish cases to which I referred above) to the contrary.

The evidence referenced includes the famous Grossman & Cormack JOLT study, but that study only analyzed the TAR systems from TREC 2009 that had the best results.  If you look at all of the TAR results from TREC 2009, as I did in Appendix A of my book, many of the TAR systems found fewer relevant documents (albeit at much lower cost) than humans performing manual review. This figure shows the number of relevant documents found:

Number of relevant documents found for five categorization tasks. The vertical scale always starts at zero. Manual review by humans is labeled "H." TAR systems analyzed by Grossman and Cormack are "UW" and "H5." Error bars are 95% confidence intervals.

Number of relevant documents found for five categorization tasks. The vertical scale always starts at zero. Manual review by humans is labeled “H.” TAR systems analyzed by Grossman and Cormack are “UW” and “H5.” Error bars are 95% confidence intervals.

If a TAR system generates relevance scores rather than binary yes/no relevance predictions, any desired recall can be achieved by producing all documents having relevance scores above an appropriately calculated cutoff.  Aiming for high recall with a system that is not working well may mean producing a lot of non-relevant documents or performing a lot of human review on the documents predicted to be relevant (i.e., documents above the relevance score cutoff) to filter out the large number of non-relevant documents that the system failed to separate from the relevant ones (possibly losing some relevant documents in the process due to reviewer mistakes).  If it is possible (through enough effort) to achieve high recall with a system that is performing poorly, why were so many TAR results far below the manual review results?  TREC 2009 participants were told they should aim to maximize their F1 scores (F1 is not a good choice for e-discovery).  Effectively, participants were told to choose their relevance score cutoffs in a way that tried to balance the desire for high recall with other concerns (high precision).  If a system wasn’t performing well, maximizing F1 meant either accepting low recall or reviewing a huge number of documents to achieve high recall without allowing too many non-relevant documents to slip into the production.

The key point is that the number of relevant documents found depends on how the system is used (e.g., how the relevance score cutoff is chosen).  The amount of effort required (amount of human document review) to achieve a desired level of recall depends on how well the system and training methodology work, which can vary quite a bit (see this article).  Achieving results that are better than manual review (in terms of the number of relevant documents found) does not happen automatically just because you wave the word “TAR” around.  You either need a system that works well for the task at hand, or you need to be willing to push a poor system far enough (low relevance score cutoff and lots of document review) to achieve good recall.  The figure above should make it clear that it is possible for TAR to give results that fall far short of manual review if it is not pushed hard enough.

The discussion above focuses on the quality of the result, but the cost of achieving the result is obviously a significant factor.  Page 14 of the decision says the case involves over 3 million documents and the cost of the predictive coding software is estimated to be between £181,988 and £469,049 (plus hosting costs) depending on factors like the number of documents culled via keyword search.  If we assume the high end of the price range applies to 3 million documents, that works out to $0.22 per document, which is about ten times what it could be if they shopped around, but still much cheaper than human review.