# ISO approves eDiscovery standards development

According to the Infosecurity article ISO approves eDiscovery standards development, an international standard is being developed that “addresses terminology, provides an overview of eDiscovery and ESI, and then addresses a range of technological and process challenges.”  The first working draft is due in early July, with comments on the draft due in early September.  A linked article in Enterprise Communications elaborates that  “The standard will also look to clarify any issues that aren’t directly dealt with in the Federal Rules of Civil Procedure.”

This is certainly an interesting development.  Since it is an international standard, it will be challenging to make it compatible with many different bodies of law.  Are they really doing something different from what EDRM already does?  Will it get much buy-in from the courts and practitioners in the field?

# Highlights from “Are Controversial Changes to the eDiscovery Rules Regarding Preservation and Proportionality on the Horizon?”

Here are some brief highlights from today’s webinar on proposed changes to the Federal Rules of Civil Procedure held by the eDiscovery Education Center and sponsored by LexisNexis.  The speakers were Michael R. Arkfeld, Judge John M. Facciola, Craig D. Ball, and Sean R. Gallagher.  The webinar was an hour long and packed with information and opinions — this is just a small sample.  The full webinar will be available for purchase within two weeks here.

The proposed changes will be open for public comment August through February.  The earliest that the new rules could go into effect is December 2015.  Gallagher encouraged people to get involved, and said the rules are still in play.

Proposed changes to 37(e) on failure to preserve discoverable information attempt to provide more clarity on the issue of sanctions.  Facciola said that fear of sanctions caused people to preserve everything.  The changes contemplate curative measures and provide factors for assessing the party’s conduct before imposing serious sanctions.  Ball said nobody has been seriously sanctioned in the past for making a good faith effort that went awry, but the new rules will put and end to hand wringing over the possibility of severe sanctions over negligence without willful misconduct.  Gallagher said the really egregious cases get the most attention, but the challenge is to create rules for more ordinary cases.  Litigants will find more comfort in having national rules instead of applying results from cases from different jurisdictions.

Ball raised the point that the new rules don’t seem to provide for sanctions if a party tries to destroy evidence but is unsuccessful in doing so.  Arkfeld questioned whether the rules would make sanctions more complicated to determine.  Facciola didn’t think so since the factors are easy to apply, and he said making them explicit eases the pain for everyone.  He said there have been some complaints because “anticipation of litigation” isn’t defined, but that would be too hard to do.  Gallagher said there have been similar concerns about “willful.”  Arkfeld asked whether the proposed change would give a pass to organizations so they don’t have to invest in controlling their ESI.  Ball said there is some concern, but the new rule doesn’t fundamentally change what judges were already doing.

Proposed changes to rule 26(b)(1) involve addition of language about proportionality and removal of language about discovery of inadmissible information that might reasonably be expected to lead to discovery of admissible evidence.   Facciola said this change is extremely important.  Ball was somewhat less excited about it, and raised concerns about whether meta data and information about companies’ IT systems would no longer be discoverable.  Gallagher was surprised that there hasn’t been more comment on the changes to rule 26, and thought there would be complaints in the future about the part that was removed.

Facciola said the language about proportionality was a technical change where things were moved around to make it clear that all discovery is subject to proportionality.  Gallagher thought that change would be significant to litigants, with more litigation over the meaning of proportionality and how it is applied, but that’s not necessarily a bad thing since things will become more efficient over the long-term.  Ball agreed and expressed concern about the possibility of litigants using their own interpretation of proportionality to limit the amount of preservation they do.  He warned that proportionality should be based on the opinion of a judge who is well-informed about the case, not the producing party’s opinion.

# Predictive Coding Performance and the Silly F1 Score

This article describes how to measure the performance of predictive coding algorithms for categorizing documents.  It describes the precision and recall metrics, and explains why the F1 score (also known as the F-measure or F-score) is virtually worthless.

Predictive coding algorithms start with a training set of example documents that have been tagged as either relevant or not relevant, and identify words or features that are useful for predicting whether or not other documents are relevant.  “Relevant” will usually mean responsive to a discovery request in litigation, or having a particular issue code, or maybe privileged (although predictive coding may not be well-suited for identifying privileged documents).  Most predictive coding algorithms will generate a relevance score or rank for each document, so you can order the documents with the ones most likely to be relevant (according to the algorithm) coming first and the ones most likely to not be relevant coming toward the end of the list.  If you apply several different algorithms to the same set of documents and generate several ordered lists of documents, what quantities should you compute to assess which algorithm made the best predictions for this document set?

You could select some number of documents, n, from the top of each list and count how many of the documents truly are relevant.  Divide the number of relevant documents by n and you have the precision, i.e. the fraction of selected documents that are relevant.  High precision is good since it means that the algorithm has done a good job of moving the relevant documents to the top of the list.  The other useful thing to know is the recall, which is the fraction of all relevant documents in the document set that were included in the algorithm’s top n documents.  Have we found 80% of the relevant documents, or only 10%?  If the answer is 10%, we probably need to increase n, i.e. select a larger set of top documents, if we are going to argue to a judge that we’re making an honest effort at finding relevant documents.  As we increase n, the recall will increase each time we encounter another document that is truly relevant.  The precision will typically decrease as we increase n because we are including more and more documents that the algorithm is increasingly pessimistic about.  We can measure precision and recall for many different values of n to generate a graph of precision as a function of recall (n is not shown explicitly, but higher recall corresponds to higher n values — the relationship is monotonic but not linear).  Click the graph to view the full-sized version:

The graph shows hypothetical results for three different algorithms.  Focus first on the blue curve representing the first algorithm.  At 10% recall it shows a precision of 69%.  So, if we work our way down from the top of the document list generated by algorithm 1 and review documents until we’ve found 10% of the documents that are truly relevant, we’ll find that 69% of the documents we encounter are truly relevant while 31% are not relevant.  If we continue to work our way down the document list, reviewing documents that the algorithm thinks are less and less likely to be relevant, and eventually get to the point where we’ve encountered 70% of the truly relevant documents (70% recall), 42% of the documents we review along the way will be truly relevant (42% precision) and 58% will not be relevant.

Turn now to the second algorithm, which is shown in green.  For all values of recall it has a lower precision than the first algorithm.  For this document set it is simply inferior (glossing over subtleties like result diversity) to the first algorithm — it returns more irrelevant documents for each truly relevant document it finds, so a human reviewer will need to wade through more junk to attain a desired level of recall.  Of course, algorithm 2 might triumph on a different document set where the features that distinguish a relevant document are different.

The third algorithm, shown in orange, is more of a mixed bag.  For low recall (left side of graph) it has higher precision than any of the other algorithms.  For high recall (right side of graph) it has the worst precision of the three algorithms.  If we were designing a web search engine to compete with Google, algorithm 3 might be pretty attractive because the precision at low recall is far more important than the precision at high recall since most people will only look at the first page or two of search results, not the 1000th page.  E-Discovery is very different from web search in that regard — you need to find most of the relevant documents, not just the 10 or 20 best ones.  Precision at high recall is critical for e-discovery, and that is where algorithm 3 falls flat on its face.  Still, there is some value in having high precision at low recall since it may help you decide early in the review that the evidence against your side is bad enough to warrant settling immediately instead of continuing the review.

You may have noticed that all three algorithms have 15% precision at 100% recall.  Don’t take that to mean that they are in any sense equally good at high recall — they are actually all completely worthless at 100% recall.  In this example, the prevalence of relevant documents is 15%, meaning that 15% of the documents in the entire document set are relevant.  If your algorithm for finding relevant documents was to simply choose documents randomly, you would achieve 15% precision for all recall values.  What makes algorithm 3 a disaster at high recall is the fact that it drops close to 15% precision long before reaching 100% recall, losing all ability to differentiate between documents that are relevant and those that are not relevant.

As alluded to earlier, high precision is desirable to reduce the amount of manual document review.  Let’s make that idea more precise.  Suppose you are the producing party in a case.  You need to produce a large percentage of the responsive documents to satisfy your duty to the court.  You use predictive coding to order the documents based on the algorithm’s prediction of which documents are most likely to be responsive.  You plan to manually review any documents that will be produced to the other side (e.g., to verify responsiveness, check for privilege, perform redactions, or just be aware of the evidence you’ll need to counter in court), so how many documents will you need to review, including non-responsive documents that the algorithm thought were responsive, to reach a reasonable recall?  Here is the formula (excluding training and validation sets):

$\text{fraction\_of\_document\_set\_to\_review} = \frac{\text{prevalence} \times \text{recall}}{\text{precision}}$

The recall is the desired level you want to reach, and the precision is measured at that recall level.  The prevalence is a property of the document set, so the only quantity in the equation that depends on the predictive coding algorithm is the precision at the desired recall.  Here is a graph based on the precision vs. recall relationships from earlier:

If your goal is to find at least 70% of the responsive documents (70% recall), you’ll need to review at least 25% of the documents ranked most highly by algorithm 1.  Keep in mind that only 15% of the whole document set is responsive in our example (i.e. 15% prevalence), so aiming to find 70% of the responsive documents by reviewing 25% of the document set means reviewing 10.5% of the document set that is responsive (70% of 15%) and 14.5% of the document set that is not responsive, which is consistent with our precision-recall graph showing 42% precision at 70% recall (10.5/25 = 0.42) for algorithm 1.  If you had the misfortune of using algorithm 3, you would need to review 50% of the entire document set just to find 70% of the responsive documents.  To achieve 70% recall you would need to review twice as many documents with algorithm 3 compared to algorithm 1 because the precision of algorithm 3 at 70% recall is half the precision of algorithm 1.

Notice how the graph slopes upward more and more rapidly as you aim for higher recall because it becomes harder and harder to find a relevant document as more and more of the low hanging fruit gets picked.  So, what recall should you aim for in an actual case?  This is where you need to discuss the issue of proportionality with the court.  Each additional responsive document is, on average, more expensive than the last one, so a balance must be struck between cost and the desire to find “everything.”  The appropriate balance will depend on the matter being litigated.

We’ve seen that recall is important to demonstrate to the court that you’ve found a substantial percentage of the responsive documents, and we’ve seen that precision determines the number of documents that must be reviewed (hence, the cost) to achieve a desired recall.  People often quote another metric, the F1 score (also known as the F-measure or F-score), which is the harmonic mean of the recall and the precision:

$F_1 = \frac{1}{\frac{1}{2}(\frac{1}{\text{recall}}+\frac{1}{\text{precision}})} = \frac{2 \times \text{precision} \times \text{recall}}{\text{precision} + \text{recall}}$

The F1 score lies between the value of the recall and the value of the precision, and tends to lie closer to the smaller of the two, so high values for the F1 score are only possible if both the precision and recall are large.

Before explaining why the F1 score is pointless for measuring predictive coding performance, let’s consider a case where it makes a little bit of sense.  Suppose we send the same set of patients to two different doctors who will each screen them for breast cancer using the palpation method (feeling for lumps).  The first doctor concludes that 50 of them need further testing, but the additional testing shows that only 3 of them actually have cancer, giving a precision of 6.0% (these numbers are entirely made up and are not necessarily realistic).  The second doctor concludes that 70 of the patients need further testing, but additional testing shows that only 4 of them have cancer, giving a precision of 5.7%.  Which doctor is better at identifying patients in need of additional testing?  The first doctor has higher precision, but that precision is achieved at a lower level of recall (only found 3 cancers instead of 4).  We know that precision tends to decline with increasing recall, so the fact that the second doctor has lower precision does not immediately lead to the conclusion that he/she is less capable.  Since the F1 score combines precision and recall such that increases in one offset (to some degree) decreases in the other, we could compute F1 scores for the two doctors.  To compute F1 we need to compute the recall, which means that we need to know how many of the patients actually have cancer.  If 5 have cancer, the F1 scores for the doctors will be 0.1091 and 0.1067 respectively, so the first doctor scores higher.  If 15 have cancer, the F1 scores will be 0.0923 and 0.0941 respectively, so the second doctor scores higher.  Increasing the number of cancers from 5 to 15 decreases the recall values, bringing them closer to the precision values, which causes the recall to have more impact (relative to the precision) on the F1 score.