Comments on Pyrrho Investments v. MWB Property and TAR vs. Manual Review

A recent decision by Master Matthews in Pyrrho Investments v. MWB Property seems to be the first judgment by a UK court allowing the use of predictive coding.  This article comments on a few aspects of the decision, especially the conclusion about how predictive coding (or TAR) performs compared to manual review.

The decision argues that predictive coding is not prohibited by English law and that it is reasonable based on proportionality, the details of the case, and expected accuracy compared to manual review.  It recaps the Da Silva Moore v. Publicis Group case from the US starting at paragraph 26, and the Irish Bank Resolution Corporation v. Quinn case from Ireland starting at paragraph 31.

Paragraph 33 enumerates ten reasons for approving predictive coding.  The second reason on the list is:

There is no evidence to show that the use of predictive coding software leads to less accurate disclosure being given than, say, manual review alone or keyword searches and manual review combined, and indeed there is some evidence (referred to in the US and Irish cases to which I referred above) to the contrary.

The evidence referenced includes the famous Grossman & Cormack JOLT study, but that study only analyzed the TAR systems from TREC 2009 that had the best results.  If you look at all of the TAR results from TREC 2009, as I did in Appendix A of my book, many of the TAR systems found fewer relevant documents (albeit at much lower cost) than humans performing manual review. This figure shows the number of relevant documents found:

Number of relevant documents found for five categorization tasks. The vertical scale always starts at zero. Manual review by humans is labeled "H." TAR systems analyzed by Grossman and Cormack are "UW" and "H5." Error bars are 95% confidence intervals.

Number of relevant documents found for five categorization tasks. The vertical scale always starts at zero. Manual review by humans is labeled “H.” TAR systems analyzed by Grossman and Cormack are “UW” and “H5.” Error bars are 95% confidence intervals.

If a TAR system generates relevance scores rather than binary yes/no relevance predictions, any desired recall can be achieved by producing all documents having relevance scores above an appropriately calculated cutoff.  Aiming for high recall with a system that is not working well may mean producing a lot of non-relevant documents or performing a lot of human review on the documents predicted to be relevant (i.e., documents above the relevance score cutoff) to filter out the large number of non-relevant documents that the system failed to separate from the relevant ones (possibly losing some relevant documents in the process due to reviewer mistakes).  If it is possible (through enough effort) to achieve high recall with a system that is performing poorly, why were so many TAR results far below the manual review results?  TREC 2009 participants were told they should aim to maximize their F1 scores (F1 is not a good choice for e-discovery).  Effectively, participants were told to choose their relevance score cutoffs in a way that tried to balance the desire for high recall with other concerns (high precision).  If a system wasn’t performing well, maximizing F1 meant either accepting low recall or reviewing a huge number of documents to achieve high recall without allowing too many non-relevant documents to slip into the production.

The key point is that the number of relevant documents found depends on how the system is used (e.g., how the relevance score cutoff is chosen).  The amount of effort required (amount of human document review) to achieve a desired level of recall depends on how well the system and training methodology work, which can vary quite a bit (see this article).  Achieving results that are better than manual review (in terms of the number of relevant documents found) does not happen automatically just because you wave the word “TAR” around.  You either need a system that works well for the task at hand, or you need to be willing to push a poor system far enough (low relevance score cutoff and lots of document review) to achieve good recall.  The figure above should make it clear that it is possible for TAR to give results that fall far short of manual review if it is not pushed hard enough.

The discussion above focuses on the quality of the result, but the cost of achieving the result is obviously a significant factor.  Page 14 of the decision says the case involves over 3 million documents and the cost of the predictive coding software is estimated to be between £181,988 and £469,049 (plus hosting costs) depending on factors like the number of documents culled via keyword search.  If we assume the high end of the price range applies to 3 million documents, that works out to $0.22 per document, which is about ten times what it could be if they shopped around, but still much cheaper than human review.

6 thoughts on “Comments on Pyrrho Investments v. MWB Property and TAR vs. Manual Review

  1. ESC

    Hi Bill
    Do you know which program they will be using in the Pyrrho Investments v. MWB Property UK case. From the approved judgment document by Master Matthews dated 2 February 2016, it sounds like he describes a TAR 1.0 protocol on page 9, items #19 and #20. He talks about “Then a representative sample of the ‘included’ documents is used to ‘train ‘ the
    software….” Would you like to comment on the presumed protocol and methodology? Thanks!

    1. Bill Dimm

      I don’t know what they are using, but it does sound like TAR 1.0. When writing the article I contemplated mentioning that the Grossman & Cormack study used continuous active learning (TAR 2.0) for 4 of the 5 tasks analyzed (the “UW” results), and used a non-predictive coding TAR method for the other task (the “H2” result), making it especially odd to claim that those results justify the very different protocol that they’re describing for this case. I decided to leave that out to reduce confusion.

  2. ESC

    I’m also kind of surprised that the UK courts aren’t more hip to the advancements in predictive coding technology. Haven’t they been reading your blog? 🙂

  3. Pingback: Reacting to the reactions to the Pyrrho predictive coding judgment | eDisclosure Information Project

  4. ESC

    Apparently the word is that they are using Relativity (Assisted Review ergo Content Analyst CAAT analytics engine?) in Pyrrho. Does this mean a control set – based on a random sample, training rounds, simple passive learning -> latent semantic indexing? TAR 1.0?

    1. Bill Dimm

      I don’t know the details of what Content Analyst does, or whether it even enforces a particular workflow (it could be that it provides basic tools and vendors build what they want with it).


Leave a Reply