Tag Archives: recall

Webinar: 10 Years Forward and Back: Automation in eDiscovery

George Socha, Doug Austin, David Horrigan, Bill Dimm, and Bill Speros will give presentations in this webinar on the history and future of ediscovery moderated by Mary Mack on December 1, 2016.  Bill Dimm will talk about the evolution of predictive coding technologies and our understanding of best practices, including recall estimation, the evil F1 score, research efforts, pre-culling, and the TAR 1.0, 2.0, and 3.0 workflows.  CLICK HERE FOR RECORDING OF WEBINAR, SLIDES, AND LINKS TO RELATED RESOURCES.

Gain Curves

You may already be familiar with the precision-recall curve, which describes the performance of a predictive coding system.  Unfortunately, the precision-recall curve doesn’t (normally) display any information about the cost of training the system, so it isn’t convenient when you want to compare the effectiveness of different training methodologies.  This article looks at the gain curve, which is better suited for that purpose.

The gain curve shows how the recall achieved depends on the number of documents reviewed (slight caveat to that at the end of the article).  Recall is the percentage of all relevant documents that have been found.  High recall is important for defensibility.  Here is an example of a gain curve (click to enlarge):


The first 12,000 documents reviewed in this example are randomly selected documents used to train the system.  Prevalence is very low in this case (0.32%), so finding relevant documents using random selection is hard.  The system needs to be exposed to a large enough number of relevant training documents for it to learn what they look like so it can make good predictions for the relevance of the remaining documents.

After the 12,000 training documents are reviewed the system orders the remaining documents to put the ones that are most likely to be relevant (based on patterns detected during training) at the top of the list.  To distinguish the training phase from the review phase I’ve shown the training phase as a solid line and review phase as a dashed line.  Review of the remaining documents starts at the top of the sorted list.  The gain curve is very steep at the beginning of the review phase because most of the documents being reviewed are relevant, so they have a big impact on recall.  As the review progresses the gain curve becomes less steep because you end up reviewing documents that are less likely to be relevant.  Review proceeds until a desired level of recall, such as 75% (the horizontal dotted line), is achieved.  The goal is to find the system and workflow that achieves the recall target at the lowest cost (i.e., the one that crosses the dotted line farthest to the left, with some caveats below).

What is the impact of using the same system with a larger or smaller number of randomly selected training documents?  This figure shows the gain curves for 9,000 and 15,000 training documents in addition to the 12,000 training document curve seen earlier:


If the goal is to reach 75% recall, 12,000 is the most efficient option among the three considered because it crosses the horizontal dotted line with the least document review.  If the target was a lower level of recall, such as 70%, 9,000 training documents would be a better choice.  A larger number of training documents usually leads to better predictions (the gain curve stays steep longer during the review phase), but there is a point where the improvement in the predictions isn’t worth the cost of reviewing additional training documents.

The discussion above assumed that the cost of reviewing a document during the training phase is the same as the cost of reviewing a document during the review phase.  That will not be the case if expensive subject matter experts are used to review the training documents and low-cost contract reviewers are used for the review phase.  In that situation, the optimal result is less straightforward to identify from the gain curve.

In some situations it may be possible produce documents without reviewing them if there is no concern about disclosing privileged documents (because there are none or because they are expected to be easy to identify by looking at things like the sender/recipient email address) or non-relevant documents (because there is no concern about them containing trade secrets or evidence of bad acts not covered by the current litigation).  When it is okay to produce documents without reviewing them, the document review associated with the dashed part of the curve can be eliminated in whole or in part.  For example, documents predicted to be relevant with high confidence may be produced without review (unless they are identified as potential privileged), whereas documents with a lower likelihood of being relevant might be reviewed to avoid disclosing too many non-relevant documents.  Again, the gain curve would not show the optimal choice in a direct way–you would need to balance the potential harm (even if small) of producing non-relevant documents against the cost of additional training.

The predictive coding process described in this article, random training documents followed by review (with no additional learning by the algorithm), is sometimes known as Simple Passive Learning (SPL), which is one example of a TAR 1.0 workflow.  To determine the optimal point to switch from training to review with TAR 1.0, a random set of documents known as a control set is reviewed and used to monitor learning progress by comparing the predictions for the control set documents to their actual relevance tags.  Other workflows and analysis of their efficiency via gain curves will be the subject of my next article.

Using Extrapolated Precision for Performance Measurement

This is a brief overview of my paper “Information Retrieval Performance Measurement Using Extrapolated Precision,” which I’ll be presenting on June 8th at the DESI VI workshop at ICAIL 2015 (slides now available here).  The paper provides a novel method for extrapolating a precision-recall point to a different level of recall, and advocates making performance comparisons by extrapolating results for all systems to the same level of recall if the systems cannot be evaluated at exactly the same recall (e.g., some predictive coding systems produce a binary yes/no prediction instead of a relevance score, so the user cannot select the recall that will be achieved).

High recall (finding most of the relevant documents) is important in e-discovery for defensibility.  High precision is desirable to ensure that there aren’t a lot of non-relevant documents mixed in with the relevant ones (i.e., high precision reduces the cost of review for responsiveness and privilege).  Making judgments about the relative performance of two predictive coding systems knowing only a single precision-recall point for each system is problematic—if one system has higher recall but lower precision for a particular task, is it the better system for that task?

There are various performance measures like the F1 score that combine precision and recall into a single number to allow performance comparisons.  Unfortunately, such measures often assume a trade-off between precision and recall that is not appropriate for e-discovery (I’ve written about problems with the  F1 score before).  To understand the problem, it is useful to look at how F1 varies as a function of the recall where it is measured.  Here are two precision-recall curves, with the one on the left being for an easy categorization task and the one on the right being for a hard task, with the F1 score corresponding to each point on the precision-recall curve superimposed:

f1_compare2If we pick a single point from the precision-recall curve and compute the value of F1 for that point, the resulting F1 is very sensitive to the precision-recall point we choose.  F1 is maximized at 46% recall in the graph on the right, which means that the trade-off between precision and recall that F1 deems to be reasonable implies that it is not worthwhile to produce more than 46% of the relevant documents for that task because precision suffers too much when you push to higher recall.  That is simply not compatible with the needs of e-discovery.  In e-discovery, the trade-off  between precision (cost) and recall required should be dictated by proportionality, not by some performance measure that is oblivious to the value of the case.  Other problems with the F1 score are detailed in the paper.

The strong dependence that F1 has on recall as we move along the precision-recall curve means that it is easy to draw wrong conclusions about which system is performing better when performance is measured at different levels of recall.  This strong dependence on recall occurs because the contours of equal F1 are not shaped like precision-recall curves, so a precision-recall curve will cut across many contours.   In order to have the freedom to measure performance at recall levels that are relevant for e-discovery (e.g., 75% or higher) without drawing wrong conclusions about which system is performing best, the paper proposes a performance measure that has constant-performance contours that are shaped like precision-recall curves, so the performance measure depends much less on the recall level where the measurement is made than F1 does. In other words, the proposed performance measure aims to be sensitive to how well the system is working while being insensitive to the specific point on the precision-recall curve where the measurement is made.  This graph compares the constant-performance contours for F1 to the measure proposed in the paper:


Since the constant-performance contours are shaped like typical precision-recall curves, we can view this measure as being equivalent to extrapolating the precision-recall point to some other target recall level, like 75%, by simply finding an idealized precision-recall curve that passes through the point and moving along that curve to the target recall.  This figure illustrates extrapolation of precision measurements for three different systems at different recall levels to 75% recall for comparison:


Finally, here is what the performance measure looks like if we evaluate it for each point in the two precision-recall curves from the first figure:


The blue performance curves are much flatter than the red F1 curves from the first figure, so the value is much less sensitive to the recall level where it is measured.  As an added bonus, the measure is an extrapolated estimate of the precision that the system would achieve at 75% recall, so it is inversely proportional to the cost of the document review needed (excluding training and testing) to reach 75% recall.  For more details, read the paper or attend my talk at DESI VI.

Can You Really Compete in TREC Retroactively?

I recently encountered a marketing piece where a vendor claimed that their tests showed their predictive coding software demonstrated favorable performance compared to the software tested in the 2009 TREC Legal Track for Topic 207 (finding Enron emails about fantasy football).  I spent some time puzzling about how they could possibly have measured their performance when they didn’t actually participate in TREC 2009.

One might question how meaningful it is to compare to performance results from 2009 since the TREC participants have probably improved their software over the past six years.  Still, how could you do the comparison if you wanted to?  The stumbling block is that TREC did not produce a yes/no relevance determination for all of the Enron emails.  Rather, they did stratified sampling and estimated recall and prevalence for the participating teams by producing relevance determinations for just a few thousand emails.

Stratified sampling means that the documents are separated into mutually-exclusive buckets called “strata.”  To the degree that stratification manages to put similar things into the same stratum, it can produce better statistical estimates (smaller uncertainty for a given amount of document review).  The TREC Legal Track for 2009 created a stratum containing documents that all participants agreed were relevant.  It also created four strata containing documents that all but one participant predicted were relevant (there were four participants, so one stratum for each dissenting participant).  There were six strata where two participants agreed on relevance, and four strata where only one of the four participants predicted the documents were relevant.  Finally, there was one stratum containing documents that all participants predicted were non-relevant, which was called the “All-N” stratum.  So, for each stratum a particular participant either predicted that all of the documents were relevant or they predicted that all of the documents were non-relevant.  You can view details about the strata in table 21 on page 39 here.  Here is an example of what a stratification might look like for just two participants (the number of documents shown and percentage that are relevant may differ from the actual data):


A random subset of documents from each stratum was chosen and reviewed so that the percentage of the documents in the stratum that were relevant could be estimated.  Multiplying that percentage by the number of documents in the stratum gives an estimate for the number of relevant documents in the stratum.  Combining the results for the various strata allows precision and recall estimates to be computed for each participant.  How could this be done for a team that didn’t participate?  Before presenting some ideas, it will be useful to have some notation:

N[i] = number of documents in stratum i
n[i] = num docs in i that were assessed by TREC
n+[i] = num docs in i that TREC assessed as relevant
V[i] = num docs in i that vendor predicted were relevant
v[i] = num docs in i that vendor predicted were relevant and were assessed by TREC
v+[i] = num docs in i that vendor predicted were relevant and assessed as relevant by TREC


To make some of the discussion below more concrete, I’ll provide formulas for computing the number of true positives (TP), false positives (FP), and false negatives (FN).  The recall and precision can then be computed from:

R = TP / (TP + FN)
P = TP / (TP + FP)

Here are some ideas I came up with:

1) They could have checked to see which strata the documents they predicted to be relevant fell into and applied the percentages TREC computed to their data.  The problem is that since they probably didn’t identify all of the documents in a stratum as being relevant the percentage of documents that were estimated to be relevant for the stratum by TREC wouldn’t really be applicable.  If their system worked really well, they may have only predicted that the truly relevant documents from the stratum were relevant.  If their system worked badly, their system may have predicted that only the truly non-relevant documents from the stratum were relevant.  This approach could give estimates that are systematically too low or too high.  Here are the relevant formulas (summing over strata, i):

TP = Sum{ V[i] * n+[i] / n[i] }
FP = Sum{ V[i] * (1 – n+[i]/n[i]) }
FN = Sum{ (N[i] – V[i]) * n+[i] / n[i] }

2) Instead of using the percentages computed by TREC, they could have computed their own percentages by looking at only the documents in the stratum that they predicted were relevant and were reviewed by TREC to give a relevance determination.  This would eliminate the possible bias from approach (1), but it also means that the percentages would be computed from a smaller sample, so the uncertainty in the percentage that are relevant would be bigger.  The vendor didn’t provide confidence intervals for their results.  Here is how the computation would go:

TP = Sum{ V[i] * v+[i] / v[i] }
FP = Sum{ V[i] * (1 – v+[i]/v[i]) }
FN = Sum{ (N[i] – V[i]) * (n+[i] – v+[i]) / (n[i] – v[i]) }

It’s possible that for some strata there would be no overlap between the documents TREC assessed and the documents the vendor predicted to be relevant since TREC typically assessed only about 4% of each stratum for Topic 207 (except the All-N stratum, where they assessed only 0.46%).  This approach wouldn’t work for those strata since v[i] would be 0.  For strata where v[i] is 0, one might use approach (1) and hope it isn’t too wrong.

3) A more sophisticated tweak on (2) would be to use the ratio n+[i]/n[i] from (1) to generate a Bayesian prior probability distribution for the proportion of documents predicted by the vendor to be relevant that actually are relevant, and then use v+[i] and v[i] to compute a posterior distribution for that proportion and use the mean of that distribution instead of v+[i]/v[i] in the computation. The idea is to have a smooth interpolation between using n+[i]/n[i] and using v+[i]/v[i] as the proportion of documents estimated to be relevant, where the interpolation would be closer to v+[i]/v[i] if v[i] is large (i.e., if there is enough data for v+[i]/v[i] to be reasonably accurate).  The result would be sensitive to choices made in creating the Bayesian prior (i.e., how much variance to give the probability distribution), however.

4) They could have ignored all of the documents that weren’t reviewed in TREC (over 500,000 of them) and just performed their predictions and analysis on the 3,709 documents that had relevance assessments (training documents should come from the set TREC didn’t assess and should be reviewed by the vendor to simulate actual training at TREC being done by the participants).  It would be very important to weight the results to compensate for the fact that those 3,709 documents didn’t all have the same probability of being selected for review.  TREC oversampled the documents that were predicted to be relevant compared to the remainder (i.e., the number of documents sampled from a stratum was not simply proportional to the number of documents in the stratum), which allowed their stratification scheme to do a good job of comparing the participating teams to each other at the expense of having large uncertainty for some quantities like the total number of relevant documents.  The prevalence of relevant documents in the full population was 1.5%, but 9.0% of the documents having relevance assessments were relevant.  Without weighting the results to compensate for the uneven sampling, you would be throwing away over half a million non-relevant documents without giving the system being tested the opportunity to incorrectly predict that some of them are relevant, which would lead to an inflated precision estimate.  The expression “shooting fish in a barrel” comes to mind.  Weighting would be accomplished by dividing by the probability of the document having been chosen (after this article was published I learned that this is called the Horvitz-Thompson estimator, and it is what the TREC evaluation toolkit uses), which is just n[i]/N[i], so the computation would be:

TP = Sum{ (N[i]/n[i]) * v+[i] }
FP = Sum{ (N[i]/n[i]) * (v[i] – v+[i]) }
FN = Sum{ (N[i]/n[i]) * (n+[i] – v+[i]) }

Note that if N[i]/n[i] is equal to V[i]/v[i], which is expected to be approximately true since the subset of a stratum chosen for assessment by TREC is random, the result would be equal to that from (2).  If N[i]/n[i] is not equal to V[i]/v[i] for a stratum, we would have the disturbing result that the estimate for TP+FP for that stratum would not equal the number of documents the vendor predicted to be relevant for that stratum, V[i].

5) The vendor could have ignored the TREC relevance determinations, simply doing their own.  That would be highly biased in the vendor’s favor because there would be a level of consistency between relevance determinations for the training data and testing data that did not exist for TREC participants.  At TREC the participants made their own relevance determinations to train their systems and a separate set of Topic Authorities made the final relevance judgments that determined the performance numbers.  To the degree that participants came to different conclusions about relevance compared to the Topic Authorities, their performance numbers would suffer.  A more subtle problem with this approach is that the vendor’s interpretation of the relevance criteria would inevitably be somewhat different from that of TREC assessors (studies have shown poor agreement between different review teams), which could make the classification task either easier or harder for a computer.  As an extreme example, if the vendor took all documents containing the word “football” to be relevant and all other documents to be non-relevant, it would be very easy for a predictive coding system to identify that pattern and achieve good performance numbers.

Approaches (1)-(4) would all give the same results for the original TREC participants because for each stratum they would either have V[i]=0 (so v[i]=0 and v+[i]=0) or they would have V[i]=N[i] (so v[i]=n[i] and v+[i]=n+[i]).  The approaches differ in how they account for the vendor predicting that only a subset of a stratum is relevant.  None of the approaches described are great.  Is there a better approach that I missed? TREC designed their strata to make the best possible comparisons between the participants.  It’s hard to imagine how an analysis could be as accurate for a system that was not taken into account in the stratification process.  If a vendor is tempted to make such comparisons, they should at least disclose their methodology and provide confidence intervals on their results so prospective clients can determine whether the performance numbers are actually meaningful.

eRecall: No Free Lunch

There has been some debate recently about the value of the “eRecall” method compared to the “Direct Recall” method for estimating the recall achieved with technology-assisted review. This article shows why eRecall requires sampling and reviewing just as many documents as the direct method if you want to achieve the same level of certainty in the result.

Here is the equation:
eRecall = (TotalRelevant – RelevantDocsMissed) / TotalRelevant

Rearranging a little:
eRecall = 1 – RelevantDocsMissed / TotalRelevant
= 1 – FractionMissed * TotalDocumentsCulled / TotalRelevant

It requires estimation (via sampling) of two quantities: the total number of relevant documents, and the number of relevant documents that were culled by the TAR tool. If your approach to TAR involves using only random sampling for training, you may have a very good estimate of the prevalence of relevant documents in the full population by simply measuring it on your (potentially large) training set, so you multiply the prevalence by the total number of documents to get TotalRelevant. To estimate the number of relevant documents missed (culled by TAR), you would need to review a random sample of the culled documents to measure the percentage of them that were relevant, i.e. FractionMissed (commonly known as the false omission rate or elusion). How many?

To simplify the argument, let’s assume that the total number of relevant documents is known exactly, so there is no need to worry about the fact that the equation involves a non-linear combination of two uncertain quantities.  Also, we’ll assume that the prevalence is low, so the number of documents culled will be nearly equal to the total number of documents.  For example, if the prevalence is 1% we might end up culling about 95% to 98% of the documents.  With this approximation, we have:

eRecall = 1 – FractionMissed / Prevalence

It is the very small prevalence value in the denominator that is the killer–it amplifies the error bar on FractionMissed, which means we have to take a ton of samples when measuring FractionMissed to achieve a reasonable error bar on eRecall.

Let’s try some specific numbers.  Suppose the prevalence is 1% and the recall (that we’re trying to estimate) happens to be 75%.  Measuring FractionMissed should give a result of about 0.25% if we take a big enough sample to have a reasonably accurate result.  If we sampled 4,000 documents from the culled set and 10 of them were relevant (i.e., 0.25%), the 95% confidence interval for FractionMissed would be (using an exact confidence interval calculator to avoid getting bad results when working with extreme values, as I advocated in a previous article):

FractionMissed = 0.12% to 0.46% with 95% confidence (4,000 samples)

Plugging those values into the eRecall equation gives a recall estimate ranging from 54% to 88% with 95% confidence.  Not a very tight error bar!

If the number of samples was increased to 40,000 (with 100 being relevant, so 0.25% again), we would have:

FractionMissed = 0.20% to 0.30% with 95% confidence (40,000 samples)

Plugging that into the eRecall equation gives a recall estimate ranging from 70% to 80% with 95% confidence, so we have now reached the ±5% level that people often aim for.

For comparison, the Direct Recall method would involve pulling a sample of 40,000 documents from the whole document set to identify roughly 400 random relevant documents, and finding that roughly 300 of the 400 were correctly predicted by the TAR system (i.e., 75% recall).  Using the calculator with a sample size of 400 and 300 relevant (“relevant” for the calculator means correctly-identified for our purposes here) gives a recall range of 70.5% to 79.2%.

So, the number of samples required for eRecall is about the same as the Direct Recall method if you require a comparable amount of certainty in the result.  There’s no free lunch to be found here.

Fair Comparison of Predictive Coding Performance

Understandably, vendors of predictive coding software want to show off numbers indicating that their software works well.  It is important for users of such software to avoid drawing wrong conclusions from performance numbers.

Consider the two precision-recall curves below (if you need to brush up on the meaning of precision and recall, see my earlier article):precision_recall_for_diff_tasks

The one on the left is incredibly good, with 97% precision at 90% recall.  The one on the right is not nearly as impressive, with 17% precision at 70% recall, though you could still find 70% of the relevant documents with no additional training by reviewing only the highest-rated 4.7% of the document population (excluding the documents reviewed for training and testing).

Why are the two curves so different?  They come from the same algorithm applied to the same document population with the same features (words) analyzed and the exact same random sample of documents used for training.  The only difference is the categorization task being attempted, i.e. what type of document we consider to be relevant.  Both tasks have nearly the same prevalence of relevant documents (0.986% for the left and 1.131% for the right), but the task on the left is very easy and the one on the right is a lot harder.  So, when a vendor quotes performance numbers, you need to keep in mind that they are only meaningful for the specific document set and task that they came from.  Performance for a different task or document set may be very different.  Comparing a vendor’s performance numbers to those from another source computed for a different categorization task on a different document set would be comparing apples to oranges.

Fair comparison of different predictive coding approaches is difficult, and one must be careful not to extrapolate results from any study too far.  As an analogy, consider performing experiments to determine whether fertilizer X works better than fertilizer Y.  You might plant marigolds in each fertilizer, apply the same amount of water and sunlight, and measure plant growth.  In other words, keep everything the same except the fertilizer.  That would give a result that applies to marigolds with the specific amount of sunlight and water used.  Would the same result occur for carrots?  You might take several different types of plants and apply the same experiment to each to see if there is a consistent winner.  What if more water was used?  Maybe fertilizer X works better for modest watering (it absorbs and retains water better) and fertilizer Y works better for heavy watering.  You might want to present results for different amounts of water so people could choose the optimal fertilizer for the amount of rainfall in their locations.  Or, you might determine the optimal amount of water for each, and declare the fertilizer that gives the most growth with its optimal amount of water the winner, which is useful only if gardeners/farmers can adjust water delivery.  The number of experiments required to cover every possibility grows exponentially with the number of parameters that can be adjusted.

Predictive coding is more complicated because there are more interdependent parts that can be varied.  Comparing classification algorithms on one document set may give a result that doesn’t apply to others, so you might test on several document sets (some with long documents, some with short, some with high prevalence, some with low, etc.), much like testing fertilizer on several types of plants, but that still doesn’t guarantee that a consistent winner will perform best on some untested set of documents.  Does a different algorithm win if the amount of training data is higher/lower, similar to a different fertilizer winning if the amount of water is changed?  What if the nature of the training data (e.g., random sample vs. active learning) is changed?  The training approach can impact different classification algorithms differently (e.g., an active learning algorithm can be optimized for a specific classification algorithm), making the results from a study on one classification algorithm inapplicable to a different algorithm.  When comparing two classification algorithms where one is known to perform poorly for high-dimensional data, should you use feature selection techniques to reduce the dimensionality of the data for that algorithm under the theory that that is how it would be used in practice, but knowing that any poor performance may come from removing an important feature rather than from a failure of the classification algorithm itself?

What you definitely should not do is plant a cactus in fertilizer X and a sunflower in fertilizer Y and compare the growth rates to draw a conclusion about which fertilizer is better.  Likewise, you should not compare predictive coding performance numbers that came from different document sets or categorization tasks.

Predictive Coding Performance and the Silly F1 Score

This article describes how to measure the performance of predictive coding algorithms for categorizing documents.  It describes the precision and recall metrics, and explains why the F1 score (also known as the F-measure or F-score) is virtually worthless.

Predictive coding algorithms start with a training set of example documents that have been tagged as either relevant or not relevant, and identify words or features that are useful for predicting whether or not other documents are relevant.  “Relevant” will usually mean responsive to a discovery request in litigation, or having a particular issue code, or maybe privileged (although predictive coding may not be well-suited for identifying privileged documents).  Most predictive coding algorithms will generate a relevance score or rank for each document, so you can order the documents with the ones most likely to be relevant (according to the algorithm) coming first and the ones most likely to not be relevant coming toward the end of the list.  If you apply several different algorithms to the same set of documents and generate several ordered lists of documents, what quantities should you compute to assess which algorithm made the best predictions for this document set?

You could select some number of documents, n, from the top of each list and count how many of the documents truly are relevant.  Divide the number of relevant documents by n and you have the precision, i.e. the fraction of selected documents that are relevant.  High precision is good since it means that the algorithm has done a good job of moving the relevant documents to the top of the list.  The other useful thing to know is the recall, which is the fraction of all relevant documents in the document set that were included in the algorithm’s top n documents.  Have we found 80% of the relevant documents, or only 10%?  If the answer is 10%, we probably need to increase n, i.e. select a larger set of top documents, if we are going to argue to a judge that we’re making an honest effort at finding relevant documents.  As we increase n, the recall will increase each time we encounter another document that is truly relevant.  The precision will typically decrease as we increase n because we are including more and more documents that the algorithm is increasingly pessimistic about.  We can measure precision and recall for many different values of n to generate a graph of precision as a function of recall (n is not shown explicitly, but higher recall corresponds to higher n values — the relationship is monotonic but not linear).  Click the graph to view the full-sized version:

graph_precisionThe graph shows hypothetical results for three different algorithms.  Focus first on the blue curve representing the first algorithm.  At 10% recall it shows a precision of 69%.  So, if we work our way down from the top of the document list generated by algorithm 1 and review documents until we’ve found 10% of the documents that are truly relevant, we’ll find that 69% of the documents we encounter are truly relevant while 31% are not relevant.  If we continue to work our way down the document list, reviewing documents that the algorithm thinks are less and less likely to be relevant, and eventually get to the point where we’ve encountered 70% of the truly relevant documents (70% recall), 42% of the documents we review along the way will be truly relevant (42% precision) and 58% will not be relevant.

Turn now to the second algorithm, which is shown in green.  For all values of recall it has a lower precision than the first algorithm.  For this document set it is simply inferior (glossing over subtleties like result diversity) to the first algorithm — it returns more irrelevant documents for each truly relevant document it finds, so a human reviewer will need to wade through more junk to attain a desired level of recall.  Of course, algorithm 2 might triumph on a different document set where the features that distinguish a relevant document are different.

The third algorithm, shown in orange, is more of a mixed bag.  For low recall (left side of graph) it has higher precision than any of the other algorithms.  For high recall (right side of graph) it has the worst precision of the three algorithms.  If we were designing a web search engine to compete with Google, algorithm 3 might be pretty attractive because the precision at low recall is far more important than the precision at high recall since most people will only look at the first page or two of search results, not the 1000th page.  E-Discovery is very different from web search in that regard — you need to find most of the relevant documents, not just the 10 or 20 best ones.  Precision at high recall is critical for e-discovery, and that is where algorithm 3 falls flat on its face.  Still, there is some value in having high precision at low recall since it may help you decide early in the review that the evidence against your side is bad enough to warrant settling immediately instead of continuing the review.

You may have noticed that all three algorithms have 15% precision at 100% recall.  Don’t take that to mean that they are in any sense equally good at high recall — they are actually all completely worthless at 100% recall.  In this example, the prevalence of relevant documents is 15%, meaning that 15% of the documents in the entire document set are relevant.  If your algorithm for finding relevant documents was to simply choose documents randomly, you would achieve 15% precision for all recall values.  What makes algorithm 3 a disaster at high recall is the fact that it drops close to 15% precision long before reaching 100% recall, losing all ability to differentiate between documents that are relevant and those that are not relevant.

As alluded to earlier, high precision is desirable to reduce the amount of manual document review.  Let’s make that idea more precise.  Suppose you are the producing party in a case.  You need to produce a large percentage of the responsive documents to satisfy your duty to the court.  You use predictive coding to order the documents based on the algorithm’s prediction of which documents are most likely to be responsive.  You plan to manually review any documents that will be produced to the other side (e.g., to verify responsiveness, check for privilege, perform redactions, or just be aware of the evidence you’ll need to counter in court), so how many documents will you need to review, including non-responsive documents that the algorithm thought were responsive, to reach a reasonable recall?  Here is the formula (excluding training and validation sets):

\text{fraction\_of\_document\_set\_to\_review} = \frac{\text{prevalence} \times \text{recall}}{\text{precision}}

The recall is the desired level you want to reach, and the precision is measured at that recall level.  The prevalence is a property of the document set, so the only quantity in the equation that depends on the predictive coding algorithm is the precision at the desired recall.  Here is a graph based on the precision vs. recall relationships from earlier:


If your goal is to find at least 70% of the responsive documents (70% recall), you’ll need to review at least 25% of the documents ranked most highly by algorithm 1.  Keep in mind that only 15% of the whole document set is responsive in our example (i.e. 15% prevalence), so aiming to find 70% of the responsive documents by reviewing 25% of the document set means reviewing 10.5% of the document set that is responsive (70% of 15%) and 14.5% of the document set that is not responsive, which is consistent with our precision-recall graph showing 42% precision at 70% recall (10.5/25 = 0.42) for algorithm 1.  If you had the misfortune of using algorithm 3, you would need to review 50% of the entire document set just to find 70% of the responsive documents.  To achieve 70% recall you would need to review twice as many documents with algorithm 3 compared to algorithm 1 because the precision of algorithm 3 at 70% recall is half the precision of algorithm 1.

Notice how the graph slopes upward more and more rapidly as you aim for higher recall because it becomes harder and harder to find a relevant document as more and more of the low hanging fruit gets picked.  So, what recall should you aim for in an actual case?  This is where you need to discuss the issue of proportionality with the court.  Each additional responsive document is, on average, more expensive than the last one, so a balance must be struck between cost and the desire to find “everything.”  The appropriate balance will depend on the matter being litigated.

We’ve seen that recall is important to demonstrate to the court that you’ve found a substantial percentage of the responsive documents, and we’ve seen that precision determines the number of documents that must be reviewed (hence, the cost) to achieve a desired recall.  People often quote another metric, the F1 score (also known as the F-measure or F-score), which is the harmonic mean of the recall and the precision:

F_1 = \frac{1}{\frac{1}{2}(\frac{1}{\text{recall}}+\frac{1}{\text{precision}})} = \frac{2 \times \text{precision} \times \text{recall}}{\text{precision} + \text{recall}}

The F1 score lies between the value of the recall and the value of the precision, and tends to lie closer to the smaller of the two, so high values for the F1 score are only possible if both the precision and recall are large.

Before explaining why the F1 score is pointless for measuring predictive coding performance, let’s consider a case where it makes a little bit of sense.  Suppose we send the same set of patients to two different doctors who will each screen them for breast cancer using the palpation method (feeling for lumps).  The first doctor concludes that 50 of them need further testing, but the additional testing shows that only 3 of them actually have cancer, giving a precision of 6.0% (these numbers are entirely made up and are not necessarily realistic).  The second doctor concludes that 70 of the patients need further testing, but additional testing shows that only 4 of them have cancer, giving a precision of 5.7%.  Which doctor is better at identifying patients in need of additional testing?  The first doctor has higher precision, but that precision is achieved at a lower level of recall (only found 3 cancers instead of 4).  We know that precision tends to decline with increasing recall, so the fact that the second doctor has lower precision does not immediately lead to the conclusion that he/she is less capable.  Since the F1 score combines precision and recall such that increases in one offset (to some degree) decreases in the other, we could compute F1 scores for the two doctors.  To compute F1 we need to compute the recall, which means that we need to know how many of the patients actually have cancer.  If 5 have cancer, the F1 scores for the doctors will be 0.1091 and 0.1067 respectively, so the first doctor scores higher.  If 15 have cancer, the F1 scores will be 0.0923 and 0.0941 respectively, so the second doctor scores higher.  Increasing the number of cancers from 5 to 15 decreases the recall values, bringing them closer to the precision values, which causes the recall to have more impact (relative to the precision) on the F1 score.

The harmonic mean is commonly used to combine rates.  For example, you should be able to convince yourself that the appropriate way to compute the average MPG fuel efficiency rating for a fleet of cars is to take the harmonic mean (not the arithmetic mean) of the MPG values of the individual cars.  But, the F1 score is the harmonic mean of two rates having different meanings, not the same rate measured for two different objects.  It’s like adding the length of your foot to the length of your arm.  They are both lengths, but does the result from adding them really make any sense?  A 10% change in the length of your arm would have much more impact than a 10% change in the length of your foot, so maybe you should add two times the length of your foot to your arm.  Or, maybe add three times the length of your foot to your arm.  The relative weighting of your foot and arm lengths is rather arbitrary since the sum you are calculating doesn’t have any specific use that could nail down the appropriate weighting.  The weighting of precision vs. recall in the F1 score is similarly arbitrary.  If you want to weight the recall more heavily, there is a metric called F2 that does that.  If you want to weight the precision more heavily, F0.5 does that.  In fact, there is a whole spectrum of F measures offering any weighting you want — you can find the formula in Wikipedia.  In our example of doctors screening for cancer, what is the right weighting to appropriately balance the potential loss of life by missing a cancer (low recall) against the cost and suffering of doing additional testing on many more patients that don’t have cancer (higher recall obtained at the expense of lower precision)?  I don’t know the answer, but it is almost certainly not F1.  Likewise, what is the appropriate weighting for predictive coding?  Probably not F1.

Why did we turn to the F1 score when comparing doctors doing cancer screenings?  We did it because we had two different recall values for the doctors, so we couldn’t compare precision values directly.  We used the F1 score to adjust for the tradeoff between precision and recall, but we did so with a weighting that was arbitrary (sometimes pronounced “wrong”).  Why were we stuck with two different recall values for the two doctors?  Unlike a predictive coding algorithm, we can’t ask a doctor to rank a group of patients based on how likely he/she thinks it is that each of them has cancer.  The doctor either feels a lump, or he/she doesn’t.  We might expand the doctor’s options to include “maybe” in addition to “yes” and “no,” but we can’t expect the doctor to say that one patient is a 85.39 score for cancer while another is a 79.82 so we can get a definite ordering. We don’t have that problem (normally) when we want to compare predictive coding algorithms — we can choose whatever recall level we are interested in and measure the precision of all algorithms at that recall, so we can compare apples to apples instead of apples to oranges.

Furthermore, a doctor’s ability to choose an appropriate threshold for sending people for additional testing is part of the measure of his/her ability, so we should allow him/her to decide how many people to send for additional testing, not just which people, and measure whether his/her choice strikes the right balance to achieve the best outcomes, which necessitates comparing different levels of recall for different doctors.  In predictive coding it is not the algorithm’s job to decide when we should stop looking for additional relevant documents — that is dictated by proportionality.  If the litigation is over a relatively small amount of money, a modest target recall may be accepted to keep review costs reasonable relative to the amount of money at stake.  If a great deal of money is at stake, pushing for a high recall that will require reviewing a lot of irrelevant documents may be warranted.  The point is that the appropriate tradeoff between low recall with high precision and high recall with lower precision depends on the economics of the case, so it cannot be captured by a statistic with fixed (arbitrary) weight like the F1 score.

Here is a graph of the F1 score for the three algorithms we’ve been looking at:


Remember that the F1 score can only be large if both the recall and precision are large.  At the left edge of the chart the recall is low, so the F1 score is small.  At the right edge the recall is high but the precision is typically low, so the F1 score is small.  Note that algorithm 1 has its maximum F1 score of 0.264 at 62% recall, while algorithm 3 has its maximum F1 score of 0.242 at 44% recall.  Comparing maximum F1 scores to identify the best algorithm is really an apples to oranges comparison (comparing values at different recall levels), and in this case it would lead you to conclude that algorithm 3 is the second best algorithm when we know that it is by far the worst algorithm at high recall.  Of course, you might retort that algorithms should be compared by comparing F1 scores at the same recall level instead of comparing maximum F1 scores, but the F1 score would really serve no purpose in that case — we could just compare precision values.

In summary, recall and precision are metrics that relate very directly to important aspects of document review — the need to identify a substantial portion of the relevant documents (recall), and the need to keep costs down by avoiding review of irrelevant documents (precision).  A predictive coding algorithm orders the document list to put the documents that are expected to have the best chance of being relevant at the top.  As the reviewer works his/her way down the document list recall will increase (more relevant documents found), but precision will typically decrease (increasing percentage of documents are not relevant).  The F1 score attempts to combine the precision and recall in a way that allows comparisons at different levels of recall by balancing increasing recall against decreasing precision, but it does so with a weighting between the two quantities that is arbitrary rather than reflecting the economics of the case.  It is better to compare algorithm performance by comparing precision at the same level of recall, with the recall chosen to be reasonable for a case.

Note:  You can read about an alternative to F1 that is more appropriate for e-discovery here.