# Misleading Metrics and Irrelevant Research (Accuracy and F1)

If one algorithm achieved 98.2% accuracy while another had 98.6% for the same task, would you be surprised to find that the first algorithm required ten times as much document review to reach 75% recall compared to the second algorithm?  This article explains why some performance metrics don’t give an accurate view of performance for ediscovery purposes, and why that makes a lot of research utilizing such metrics irrelevant for ediscovery.

The key performance metrics for ediscovery are precision and recall.  Recall, R, is the percentage of all relevant documents that have been found.  High recall is critical to defensibility.  Precision, P, is the percentage of documents predicted to be relevant that actually are relevant.  High precision is desirable to avoid wasting time reviewing non-relevant documents (if documents will be reviewed to confirm relevance and check for privilege before production).  In other words, precision is related to cost.  Specifically, 1/P is the average number of documents you’ll have to review per relevant document found.  When using technology-assisted review (predictive coding), documents can be sorted by relevance score and you can choose any point in the sorted list and compute the recall and precision that would be achieved by treating documents above that point as being predicted to be relevant.  One can plot a precision-recall curve by doing precision and recall calculations at various points in the sorted document list.

The precision-recall curve to the right compares two different classification algorithms applied to the same task.  To do a sensible comparison, we should compare precision values at the same level of recall.  In other words, we should compare the cost of reaching equally good (same recall) productions.  Furthermore, the recall level where the algorithms are compared should be one that is sensible for for ediscovery — achieving high precision at a recall level a court wouldn’t accept isn’t very useful.  If we compare the two algorithms at R=75%, 1-NN has P=6.6% and 40-NN has P=70.4%.  In other words, if you sort by relevance score with the two algorithms and review documents from top down until 75% of the relevant documents are found, you would review 15.2 documents per relevant document found with 1-NN and 1.4 documents per relevant document found with 40-NN.  The 1-NN algorithm would require over ten times as much document review as 40-NN.  1-NN has been used in some popular TAR systems.  I explained why it performs so badly in a previous article.

There are many other performance metrics, but they can be written as a mixture of precision and recall (see Chapter 7 of the current draft of my book).  Anything that is a mixture of precision and recall should raise an eyebrow — how can you mix together two fundamentally different things (defensibility and cost) into a single number and get a useful result?  Such metrics imply a trade-off between defensibility and cost that is not based on reality.  Research papers that aren’t focused on ediscovery often use such performance measures and compare algorithms without worrying about whether they are achieving the same recall, or whether the recall is high enough to be considered sufficient for ediscovery.  Thus, many conclusions about algorithm effectiveness simply aren’t applicable for ediscovery because they aren’t based on relevant metrics.

One popular metric is accuracy, which is the percentage of predictions that are correct.  If a system predicts that none of the documents are relevant and prevalence is 10% (meaning 10% of the documents are relevant), it will have 90% accuracy because its predictions were correct for all of the non-relevant documents.  If prevalence is 1%, a system that predicts none of the documents are relevant achieves 99% accuracy.  Such incredibly high numbers for algorithms that fail to find anything!  When prevalence is low, as it often is in ediscovery, accuracy makes everything look like it performs well, including algorithms like 1-NN that can be a disaster at high recall.  The graph to the right shows the accuracy-recall curve that corresponds to the earlier precision-recall curve (prevalence is 2.633% in this case), showing that it is easy to achieve high accuracy with a poor algorithm by evaluating it at a low recall level that would not be acceptable for ediscovery.  The maximum accuracy achieved by 1-NN in this case was 98.2% and the max for 40-NN was 98.6%.  In case you are curious, the relationship between accuracy, precision, and recall is:
$ACC = 1 - \rho (1 - R) - \rho R (1 - P) / P$
where $\rho$ is the prevalence.

Another popular metric is the F1 score.  I’ve criticized its use in ediscovery before.  The relationship to precision and recall is:
$F_1 = 2 P R / (P + R)$
The F1 score lies between the precision and the recall, and is closer to the smaller of the two.  As far as F1 is concerned, 30% recall with 90% precision is just as good as 90% recall with 30% precision (both give F1 = 0.45) even though the former probably wouldn’t be accepted by a court and the latter would.   F1 cannot be large at small recall, unlike accuracy, but it can be moderately high at modest recall, making it possible to achieve a decent F1 score even if performance is disastrously bad at the high recall levels demanded by ediscovery.  The graph to the right shows that 1-NN manages to achieve a maximum F1 of 0.64, which seems pretty good compared to the 0.73 achieved by 40-NN, giving no hint that 1-NN requires ten times as much review to achieve 75% recall in this example.

Hopefully this article has convinced you that it is important for research papers to use the right metric, specifically precision (or review effort) at high recall, when making algorithm comparisons that are useful for ediscovery.

# Webinar: 10 Years Forward and Back: Automation in eDiscovery

George Socha, Doug Austin, David Horrigan, Bill Dimm, and Bill Speros will give presentations in this webinar on the history and future of ediscovery moderated by Mary Mack on December 1, 2016.  Bill Dimm will talk about the evolution of predictive coding technologies and our understanding of best practices, including recall estimation, the evil F1 score, research efforts, pre-culling, and the TAR 1.0, 2.0, and 3.0 workflows.  CLICK HERE FOR RECORDING OF WEBINAR, SLIDES, AND LINKS TO RELATED RESOURCES.

# Using Extrapolated Precision for Performance Measurement

This is a brief overview of my paper “Information Retrieval Performance Measurement Using Extrapolated Precision,” which I’ll be presenting on June 8th at the DESI VI workshop at ICAIL 2015 (slides now available here).  The paper provides a novel method for extrapolating a precision-recall point to a different level of recall, and advocates making performance comparisons by extrapolating results for all systems to the same level of recall if the systems cannot be evaluated at exactly the same recall (e.g., some predictive coding systems produce a binary yes/no prediction instead of a relevance score, so the user cannot select the recall that will be achieved).

High recall (finding most of the relevant documents) is important in e-discovery for defensibility.  High precision is desirable to ensure that there aren’t a lot of non-relevant documents mixed in with the relevant ones (i.e., high precision reduces the cost of review for responsiveness and privilege).  Making judgments about the relative performance of two predictive coding systems knowing only a single precision-recall point for each system is problematic—if one system has higher recall but lower precision for a particular task, is it the better system for that task?

There are various performance measures like the F1 score that combine precision and recall into a single number to allow performance comparisons.  Unfortunately, such measures often assume a trade-off between precision and recall that is not appropriate for e-discovery (I’ve written about problems with the  F1 score before).  To understand the problem, it is useful to look at how F1 varies as a function of the recall where it is measured.  Here are two precision-recall curves, with the one on the left being for an easy categorization task and the one on the right being for a hard task, with the F1 score corresponding to each point on the precision-recall curve superimposed:

If we pick a single point from the precision-recall curve and compute the value of F1 for that point, the resulting F1 is very sensitive to the precision-recall point we choose.  F1 is maximized at 46% recall in the graph on the right, which means that the trade-off between precision and recall that F1 deems to be reasonable implies that it is not worthwhile to produce more than 46% of the relevant documents for that task because precision suffers too much when you push to higher recall.  That is simply not compatible with the needs of e-discovery.  In e-discovery, the trade-off  between precision (cost) and recall required should be dictated by proportionality, not by some performance measure that is oblivious to the value of the case.  Other problems with the F1 score are detailed in the paper.

The strong dependence that F1 has on recall as we move along the precision-recall curve means that it is easy to draw wrong conclusions about which system is performing better when performance is measured at different levels of recall.  This strong dependence on recall occurs because the contours of equal F1 are not shaped like precision-recall curves, so a precision-recall curve will cut across many contours.   In order to have the freedom to measure performance at recall levels that are relevant for e-discovery (e.g., 75% or higher) without drawing wrong conclusions about which system is performing best, the paper proposes a performance measure that has constant-performance contours that are shaped like precision-recall curves, so the performance measure depends much less on the recall level where the measurement is made than F1 does. In other words, the proposed performance measure aims to be sensitive to how well the system is working while being insensitive to the specific point on the precision-recall curve where the measurement is made.  This graph compares the constant-performance contours for F1 to the measure proposed in the paper:

Since the constant-performance contours are shaped like typical precision-recall curves, we can view this measure as being equivalent to extrapolating the precision-recall point to some other target recall level, like 75%, by simply finding an idealized precision-recall curve that passes through the point and moving along that curve to the target recall.  This figure illustrates extrapolation of precision measurements for three different systems at different recall levels to 75% recall for comparison:

Finally, here is what the performance measure looks like if we evaluate it for each point in the two precision-recall curves from the first figure:

The blue performance curves are much flatter than the red F1 curves from the first figure, so the value is much less sensitive to the recall level where it is measured.  As an added bonus, the measure is an extrapolated estimate of the precision that the system would achieve at 75% recall, so it is inversely proportional to the cost of the document review needed (excluding training and testing) to reach 75% recall.  For more details, read the paper or attend my talk at DESI VI.

# Predictive Coding Performance and the Silly F1 Score

This article describes how to measure the performance of predictive coding algorithms for categorizing documents.  It describes the precision and recall metrics, and explains why the F1 score (also known as the F-measure or F-score) is virtually worthless.

Predictive coding algorithms start with a training set of example documents that have been tagged as either relevant or not relevant, and identify words or features that are useful for predicting whether or not other documents are relevant.  “Relevant” will usually mean responsive to a discovery request in litigation, or having a particular issue code, or maybe privileged (although predictive coding may not be well-suited for identifying privileged documents).  Most predictive coding algorithms will generate a relevance score or rank for each document, so you can order the documents with the ones most likely to be relevant (according to the algorithm) coming first and the ones most likely to not be relevant coming toward the end of the list.  If you apply several different algorithms to the same set of documents and generate several ordered lists of documents, what quantities should you compute to assess which algorithm made the best predictions for this document set?

You could select some number of documents, n, from the top of each list and count how many of the documents truly are relevant.  Divide the number of relevant documents by n and you have the precision, i.e. the fraction of selected documents that are relevant.  High precision is good since it means that the algorithm has done a good job of moving the relevant documents to the top of the list.  The other useful thing to know is the recall, which is the fraction of all relevant documents in the document set that were included in the algorithm’s top n documents.  Have we found 80% of the relevant documents, or only 10%?  If the answer is 10%, we probably need to increase n, i.e. select a larger set of top documents, if we are going to argue to a judge that we’re making an honest effort at finding relevant documents.  As we increase n, the recall will increase each time we encounter another document that is truly relevant.  The precision will typically decrease as we increase n because we are including more and more documents that the algorithm is increasingly pessimistic about.  We can measure precision and recall for many different values of n to generate a graph of precision as a function of recall (n is not shown explicitly, but higher recall corresponds to higher n values — the relationship is monotonic but not linear).  Click the graph to view the full-sized version:

The graph shows hypothetical results for three different algorithms.  Focus first on the blue curve representing the first algorithm.  At 10% recall it shows a precision of 69%.  So, if we work our way down from the top of the document list generated by algorithm 1 and review documents until we’ve found 10% of the documents that are truly relevant, we’ll find that 69% of the documents we encounter are truly relevant while 31% are not relevant.  If we continue to work our way down the document list, reviewing documents that the algorithm thinks are less and less likely to be relevant, and eventually get to the point where we’ve encountered 70% of the truly relevant documents (70% recall), 42% of the documents we review along the way will be truly relevant (42% precision) and 58% will not be relevant.

Turn now to the second algorithm, which is shown in green.  For all values of recall it has a lower precision than the first algorithm.  For this document set it is simply inferior (glossing over subtleties like result diversity) to the first algorithm — it returns more irrelevant documents for each truly relevant document it finds, so a human reviewer will need to wade through more junk to attain a desired level of recall.  Of course, algorithm 2 might triumph on a different document set where the features that distinguish a relevant document are different.

The third algorithm, shown in orange, is more of a mixed bag.  For low recall (left side of graph) it has higher precision than any of the other algorithms.  For high recall (right side of graph) it has the worst precision of the three algorithms.  If we were designing a web search engine to compete with Google, algorithm 3 might be pretty attractive because the precision at low recall is far more important than the precision at high recall since most people will only look at the first page or two of search results, not the 1000th page.  E-Discovery is very different from web search in that regard — you need to find most of the relevant documents, not just the 10 or 20 best ones.  Precision at high recall is critical for e-discovery, and that is where algorithm 3 falls flat on its face.  Still, there is some value in having high precision at low recall since it may help you decide early in the review that the evidence against your side is bad enough to warrant settling immediately instead of continuing the review.

You may have noticed that all three algorithms have 15% precision at 100% recall.  Don’t take that to mean that they are in any sense equally good at high recall — they are actually all completely worthless at 100% recall.  In this example, the prevalence of relevant documents is 15%, meaning that 15% of the documents in the entire document set are relevant.  If your algorithm for finding relevant documents was to simply choose documents randomly, you would achieve 15% precision for all recall values.  What makes algorithm 3 a disaster at high recall is the fact that it drops close to 15% precision long before reaching 100% recall, losing all ability to differentiate between documents that are relevant and those that are not relevant.

As alluded to earlier, high precision is desirable to reduce the amount of manual document review.  Let’s make that idea more precise.  Suppose you are the producing party in a case.  You need to produce a large percentage of the responsive documents to satisfy your duty to the court.  You use predictive coding to order the documents based on the algorithm’s prediction of which documents are most likely to be responsive.  You plan to manually review any documents that will be produced to the other side (e.g., to verify responsiveness, check for privilege, perform redactions, or just be aware of the evidence you’ll need to counter in court), so how many documents will you need to review, including non-responsive documents that the algorithm thought were responsive, to reach a reasonable recall?  Here is the formula (excluding training and validation sets):

$\text{fraction\_of\_document\_set\_to\_review} = \frac{\text{prevalence} \times \text{recall}}{\text{precision}}$

The recall is the desired level you want to reach, and the precision is measured at that recall level.  The prevalence is a property of the document set, so the only quantity in the equation that depends on the predictive coding algorithm is the precision at the desired recall.  Here is a graph based on the precision vs. recall relationships from earlier:

If your goal is to find at least 70% of the responsive documents (70% recall), you’ll need to review at least 25% of the documents ranked most highly by algorithm 1.  Keep in mind that only 15% of the whole document set is responsive in our example (i.e. 15% prevalence), so aiming to find 70% of the responsive documents by reviewing 25% of the document set means reviewing 10.5% of the document set that is responsive (70% of 15%) and 14.5% of the document set that is not responsive, which is consistent with our precision-recall graph showing 42% precision at 70% recall (10.5/25 = 0.42) for algorithm 1.  If you had the misfortune of using algorithm 3, you would need to review 50% of the entire document set just to find 70% of the responsive documents.  To achieve 70% recall you would need to review twice as many documents with algorithm 3 compared to algorithm 1 because the precision of algorithm 3 at 70% recall is half the precision of algorithm 1.

Notice how the graph slopes upward more and more rapidly as you aim for higher recall because it becomes harder and harder to find a relevant document as more and more of the low hanging fruit gets picked.  So, what recall should you aim for in an actual case?  This is where you need to discuss the issue of proportionality with the court.  Each additional responsive document is, on average, more expensive than the last one, so a balance must be struck between cost and the desire to find “everything.”  The appropriate balance will depend on the matter being litigated.

We’ve seen that recall is important to demonstrate to the court that you’ve found a substantial percentage of the responsive documents, and we’ve seen that precision determines the number of documents that must be reviewed (hence, the cost) to achieve a desired recall.  People often quote another metric, the F1 score (also known as the F-measure or F-score), which is the harmonic mean of the recall and the precision:

$F_1 = \frac{1}{\frac{1}{2}(\frac{1}{\text{recall}}+\frac{1}{\text{precision}})} = \frac{2 \times \text{precision} \times \text{recall}}{\text{precision} + \text{recall}}$

The F1 score lies between the value of the recall and the value of the precision, and tends to lie closer to the smaller of the two, so high values for the F1 score are only possible if both the precision and recall are large.

Before explaining why the F1 score is pointless for measuring predictive coding performance, let’s consider a case where it makes a little bit of sense.  Suppose we send the same set of patients to two different doctors who will each screen them for breast cancer using the palpation method (feeling for lumps).  The first doctor concludes that 50 of them need further testing, but the additional testing shows that only 3 of them actually have cancer, giving a precision of 6.0% (these numbers are entirely made up and are not necessarily realistic).  The second doctor concludes that 70 of the patients need further testing, but additional testing shows that only 4 of them have cancer, giving a precision of 5.7%.  Which doctor is better at identifying patients in need of additional testing?  The first doctor has higher precision, but that precision is achieved at a lower level of recall (only found 3 cancers instead of 4).  We know that precision tends to decline with increasing recall, so the fact that the second doctor has lower precision does not immediately lead to the conclusion that he/she is less capable.  Since the F1 score combines precision and recall such that increases in one offset (to some degree) decreases in the other, we could compute F1 scores for the two doctors.  To compute F1 we need to compute the recall, which means that we need to know how many of the patients actually have cancer.  If 5 have cancer, the F1 scores for the doctors will be 0.1091 and 0.1067 respectively, so the first doctor scores higher.  If 15 have cancer, the F1 scores will be 0.0923 and 0.0941 respectively, so the second doctor scores higher.  Increasing the number of cancers from 5 to 15 decreases the recall values, bringing them closer to the precision values, which causes the recall to have more impact (relative to the precision) on the F1 score.