Tag Archives: recall

eRecall: No Free Lunch

There has been some debate recently about the value of the “eRecall” method compared to the “Direct Recall” method for estimating the recall achieved with technology-assisted review. This article shows why eRecall requires sampling and reviewing just as many documents as the direct method if you want to achieve the same level of certainty in the result.

Here is the equation:
eRecall = (TotalRelevant – RelevantDocsMissed) / TotalRelevant

Rearranging a little:
eRecall = 1 – RelevantDocsMissed / TotalRelevant
= 1 – FractionMissed * TotalDocumentsCulled / TotalRelevant

It requires estimation (via sampling) of two quantities: the total number of relevant documents, and the number of relevant documents that were culled by the TAR tool. If your approach to TAR involves using only random sampling for training, you may have a very good estimate of the prevalence of relevant documents in the full population by simply measuring it on your (potentially large) training set, so you multiply the prevalence by the total number of documents to get TotalRelevant. To estimate the number of relevant documents missed (culled by TAR), you would need to review a random sample of the culled documents to measure the percentage of them that were relevant, i.e. FractionMissed (commonly known as the false omission rate or elusion). How many?

To simplify the argument, let’s assume that the total number of relevant documents is known exactly, so there is no need to worry about the fact that the equation involves a non-linear combination of two uncertain quantities.  Also, we’ll assume that the prevalence is low, so the number of documents culled will be nearly equal to the total number of documents.  For example, if the prevalence is 1% we might end up culling about 95% to 98% of the documents.  With this approximation, we have:

eRecall = 1 – FractionMissed / Prevalence

It is the very small prevalence value in the denominator that is the killer–it amplifies the error bar on FractionMissed, which means we have to take a ton of samples when measuring FractionMissed to achieve a reasonable error bar on eRecall.

Let’s try some specific numbers.  Suppose the prevalence is 1% and the recall (that we’re trying to estimate) happens to be 75%.  Measuring FractionMissed should give a result of about 0.25% if we take a big enough sample to have a reasonably accurate result.  If we sampled 4,000 documents from the culled set and 10 of them were relevant (i.e., 0.25%), the 95% confidence interval for FractionMissed would be (using an exact confidence interval calculator to avoid getting bad results when working with extreme values, as I advocated in a previous article):

FractionMissed = 0.12% to 0.46% with 95% confidence (4,000 samples)

Plugging those values into the eRecall equation gives a recall estimate ranging from 54% to 88% with 95% confidence.  Not a very tight error bar!

If the number of samples was increased to 40,000 (with 100 being relevant, so 0.25% again), we would have:

FractionMissed = 0.20% to 0.30% with 95% confidence (40,000 samples)

Plugging that into the eRecall equation gives a recall estimate ranging from 70% to 80% with 95% confidence, so we have now reached the ±5% level that people often aim for.

For comparison, the Direct Recall method would involve pulling a sample of 40,000 documents from the whole document set to identify roughly 400 random relevant documents, and finding that roughly 300 of the 400 were correctly predicted by the TAR system (i.e., 75% recall).  Using the calculator with a sample size of 400 and 300 relevant (“relevant” for the calculator means correctly-identified for our purposes here) gives a recall range of 70.5% to 79.2%.

So, the number of samples required for eRecall is about the same as the Direct Recall method if you require a comparable amount of certainty in the result.  There’s no free lunch to be found here.

Fair Comparison of Predictive Coding Performance

Understandably, vendors of predictive coding software want to show off numbers indicating that their software works well.  It is important for users of such software to avoid drawing wrong conclusions from performance numbers.

Consider the two precision-recall curves below (if you need to brush up on the meaning of precision and recall, see my earlier article):precision_recall_for_diff_tasks

The one on the left is incredibly good, with 97% precision at 90% recall.  The one on the right is not nearly as impressive, with 17% precision at 70% recall, though you could still find 70% of the relevant documents with no additional training by reviewing only the highest-rated 4.7% of the document population (excluding the documents reviewed for training and testing).

Why are the two curves so different?  They come from the same algorithm applied to the same document population with the same features (words) analyzed and the exact same random sample of documents used for training.  The only difference is the categorization task being attempted, i.e. what type of document we consider to be relevant.  Both tasks have nearly the same prevalence of relevant documents (0.986% for the left and 1.131% for the right), but the task on the left is very easy and the one on the right is a lot harder.  So, when a vendor quotes performance numbers, you need to keep in mind that they are only meaningful for the specific document set and task that they came from.  Performance for a different task or document set may be very different.  Comparing a vendor’s performance numbers to those from another source computed for a different categorization task on a different document set would be comparing apples to oranges.

Fair comparison of different predictive coding approaches is difficult, and one must be careful not to extrapolate results from any study too far.  As an analogy, consider performing experiments to determine whether fertilizer X works better than fertilizer Y.  You might plant marigolds in each fertilizer, apply the same amount of water and sunlight, and measure plant growth.  In other words, keep everything the same except the fertilizer.  That would give a result that applies to marigolds with the specific amount of sunlight and water used.  Would the same result occur for carrots?  You might take several different types of plants and apply the same experiment to each to see if there is a consistent winner.  What if more water was used?  Maybe fertilizer X works better for modest watering (it absorbs and retains water better) and fertilizer Y works better for heavy watering.  You might want to present results for different amounts of water so people could choose the optimal fertilizer for the amount of rainfall in their locations.  Or, you might determine the optimal amount of water for each, and declare the fertilizer that gives the most growth with its optimal amount of water the winner, which is useful only if gardeners/farmers can adjust water delivery.  The number of experiments required to cover every possibility grows exponentially with the number of parameters that can be adjusted.

Predictive coding is more complicated because there are more interdependent parts that can be varied.  Comparing classification algorithms on one document set may give a result that doesn’t apply to others, so you might test on several document sets (some with long documents, some with short, some with high prevalence, some with low, etc.), much like testing fertilizer on several types of plants, but that still doesn’t guarantee that a consistent winner will perform best on some untested set of documents.  Does a different algorithm win if the amount of training data is higher/lower, similar to a different fertilizer winning if the amount of water is changed?  What if the nature of the training data (e.g., random sample vs. active learning) is changed?  The training approach can impact different classification algorithms differently (e.g., an active learning algorithm can be optimized for a specific classification algorithm), making the results from a study on one classification algorithm inapplicable to a different algorithm.  When comparing two classification algorithms where one is known to perform poorly for high-dimensional data, should you use feature selection techniques to reduce the dimensionality of the data for that algorithm under the theory that that is how it would be used in practice, but knowing that any poor performance may come from removing an important feature rather than from a failure of the classification algorithm itself?

What you definitely should not do is plant a cactus in fertilizer X and a sunflower in fertilizer Y and compare the growth rates to draw a conclusion about which fertilizer is better.  Likewise, you should not compare predictive coding performance numbers that came from different document sets or categorization tasks.

Predictive Coding Performance and the Silly F1 Score

This article describes how to measure the performance of predictive coding algorithms for categorizing documents.  It describes the precision and recall metrics, and explains why the F1 score (also known as the F-measure or F-score) is virtually worthless.

Predictive coding algorithms start with a training set of example documents that have been tagged as either relevant or not relevant, and identify words or features that are useful for predicting whether or not other documents are relevant.  “Relevant” will usually mean responsive to a discovery request in litigation, or having a particular issue code, or maybe privileged (although predictive coding may not be well-suited for identifying privileged documents).  Most predictive coding algorithms will generate a relevance score or rank for each document, so you can order the documents with the ones most likely to be relevant (according to the algorithm) coming first and the ones most likely to not be relevant coming toward the end of the list.  If you apply several different algorithms to the same set of documents and generate several ordered lists of documents, what quantities should you compute to assess which algorithm made the best predictions for this document set?

You could select some number of documents, n, from the top of each list and count how many of the documents truly are relevant.  Divide the number of relevant documents by n and you have the precision, i.e. the fraction of selected documents that are relevant.  High precision is good since it means that the algorithm has done a good job of moving the relevant documents to the top of the list.  The other useful thing to know is the recall, which is the fraction of all relevant documents in the document set that were included in the algorithm’s top n documents.  Have we found 80% of the relevant documents, or only 10%?  If the answer is 10%, we probably need to increase n, i.e. select a larger set of top documents, if we are going to argue to a judge that we’re making an honest effort at finding relevant documents.  As we increase n, the recall will increase each time we encounter another document that is truly relevant.  The precision will typically decrease as we increase n because we are including more and more documents that the algorithm is increasingly pessimistic about.  We can measure precision and recall for many different values of n to generate a graph of precision as a function of recall (n is not shown explicitly, but higher recall corresponds to higher n values — the relationship is monotonic but not linear).  Click the graph to view the full-sized version:

graph_precisionThe graph shows hypothetical results for three different algorithms.  Focus first on the blue curve representing the first algorithm.  At 10% recall it shows a precision of 69%.  So, if we work our way down from the top of the document list generated by algorithm 1 and review documents until we’ve found 10% of the documents that are truly relevant, we’ll find that 69% of the documents we encounter are truly relevant while 31% are not relevant.  If we continue to work our way down the document list, reviewing documents that the algorithm thinks are less and less likely to be relevant, and eventually get to the point where we’ve encountered 70% of the truly relevant documents (70% recall), 42% of the documents we review along the way will be truly relevant (42% precision) and 58% will not be relevant.

Turn now to the second algorithm, which is shown in green.  For all values of recall it has a lower precision than the first algorithm.  For this document set it is simply inferior (glossing over subtleties like result diversity) to the first algorithm — it returns more irrelevant documents for each truly relevant document it finds, so a human reviewer will need to wade through more junk to attain a desired level of recall.  Of course, algorithm 2 might triumph on a different document set where the features that distinguish a relevant document are different.

The third algorithm, shown in orange, is more of a mixed bag.  For low recall (left side of graph) it has higher precision than any of the other algorithms.  For high recall (right side of graph) it has the worst precision of the three algorithms.  If we were designing a web search engine to compete with Google, algorithm 3 might be pretty attractive because the precision at low recall is far more important than the precision at high recall since most people will only look at the first page or two of search results, not the 1000th page.  E-Discovery is very different from web search in that regard — you need to find most of the relevant documents, not just the 10 or 20 best ones.  Precision at high recall is critical for e-discovery, and that is where algorithm 3 falls flat on its face.  Still, there is some value in having high precision at low recall since it may help you decide early in the review that the evidence against your side is bad enough to warrant settling immediately instead of continuing the review.

You may have noticed that all three algorithms have 15% precision at 100% recall.  Don’t take that to mean that they are in any sense equally good at high recall — they are actually all completely worthless at 100% recall.  In this example, the prevalence of relevant documents is 15%, meaning that 15% of the documents in the entire document set are relevant.  If your algorithm for finding relevant documents was to simply choose documents randomly, you would achieve 15% precision for all recall values.  What makes algorithm 3 a disaster at high recall is the fact that it drops close to 15% precision long before reaching 100% recall, losing all ability to differentiate between documents that are relevant and those that are not relevant.

As alluded to earlier, high precision is desirable to reduce the amount of manual document review.  Let’s make that idea more precise.  Suppose you are the producing party in a case.  You need to produce a large percentage of the responsive documents to satisfy your duty to the court.  You use predictive coding to order the documents based on the algorithm’s prediction of which documents are most likely to be responsive.  You plan to manually review any documents that will be produced to the other side (e.g., to verify responsiveness, check for privilege, perform redactions, or just be aware of the evidence you’ll need to counter in court), so how many documents will you need to review, including non-responsive documents that the algorithm thought were responsive, to reach a reasonable recall?  Here is the formula (excluding training and validation sets):

\text{fraction\_of\_document\_set\_to\_review} = \frac{\text{prevalence} \times \text{recall}}{\text{precision}}

The recall is the desired level you want to reach, and the precision is measured at that recall level.  The prevalence is a property of the document set, so the only quantity in the equation that depends on the predictive coding algorithm is the precision at the desired recall.  Here is a graph based on the precision vs. recall relationships from earlier:

graph_docs_to_review

If your goal is to find at least 70% of the responsive documents (70% recall), you’ll need to review at least 25% of the documents ranked most highly by algorithm 1.  Keep in mind that only 15% of the whole document set is responsive in our example (i.e. 15% prevalence), so aiming to find 70% of the responsive documents by reviewing 25% of the document set means reviewing 10.5% of the document set that is responsive (70% of 15%) and 14.5% of the document set that is not responsive, which is consistent with our precision-recall graph showing 42% precision at 70% recall (10.5/25 = 0.42) for algorithm 1.  If you had the misfortune of using algorithm 3, you would need to review 50% of the entire document set just to find 70% of the responsive documents.  To achieve 70% recall you would need to review twice as many documents with algorithm 3 compared to algorithm 1 because the precision of algorithm 3 at 70% recall is half the precision of algorithm 1.

Notice how the graph slopes upward more and more rapidly as you aim for higher recall because it becomes harder and harder to find a relevant document as more and more of the low hanging fruit gets picked.  So, what recall should you aim for in an actual case?  This is where you need to discuss the issue of proportionality with the court.  Each additional responsive document is, on average, more expensive than the last one, so a balance must be struck between cost and the desire to find “everything.”  The appropriate balance will depend on the matter being litigated.

We’ve seen that recall is important to demonstrate to the court that you’ve found a substantial percentage of the responsive documents, and we’ve seen that precision determines the number of documents that must be reviewed (hence, the cost) to achieve a desired recall.  People often quote another metric, the F1 score (also known as the F-measure or F-score), which is the harmonic mean of the recall and the precision:

F_1 = \frac{1}{\frac{1}{2}(\frac{1}{\text{recall}}+\frac{1}{\text{precision}})} = \frac{2 \times \text{precision} \times \text{recall}}{\text{precision} + \text{recall}}

The F1 score lies between the value of the recall and the value of the precision, and tends to lie closer to the smaller of the two, so high values for the F1 score are only possible if both the precision and recall are large.

Before explaining why the F1 score is pointless for measuring predictive coding performance, let’s consider a case where it makes a little bit of sense.  Suppose we send the same set of patients to two different doctors who will each screen them for breast cancer using the palpation method (feeling for lumps).  The first doctor concludes that 50 of them need further testing, but the additional testing shows that only 3 of them actually have cancer, giving a precision of 6.0% (these numbers are entirely made up and are not necessarily realistic).  The second doctor concludes that 70 of the patients need further testing, but additional testing shows that only 4 of them have cancer, giving a precision of 5.7%.  Which doctor is better at identifying patients in need of additional testing?  The first doctor has higher precision, but that precision is achieved at a lower level of recall (only found 3 cancers instead of 4).  We know that precision tends to decline with increasing recall, so the fact that the second doctor has lower precision does not immediately lead to the conclusion that he/she is less capable.  Since the F1 score combines precision and recall such that increases in one offset (to some degree) decreases in the other, we could compute F1 scores for the two doctors.  To compute F1 we need to compute the recall, which means that we need to know how many of the patients actually have cancer.  If 5 have cancer, the F1 scores for the doctors will be 0.1091 and 0.1067 respectively, so the first doctor scores higher.  If 15 have cancer, the F1 scores will be 0.0923 and 0.0941 respectively, so the second doctor scores higher.  Increasing the number of cancers from 5 to 15 decreases the recall values, bringing them closer to the precision values, which causes the recall to have more impact (relative to the precision) on the F1 score.

The harmonic mean is commonly used to combine rates.  For example, you should be able to convince yourself that the appropriate way to compute the average MPG fuel efficiency rating for a fleet of cars is to take the harmonic mean (not the arithmetic mean) of the MPG values of the individual cars.  But, the F1 score is the harmonic mean of two rates having different meanings, not the same rate measured for two different objects.  It’s like adding the length of your foot to the length of your arm.  They are both lengths, but does the result from adding them really make any sense?  A 10% change in the length of your arm would have much more impact than a 10% change in the length of your foot, so maybe you should add two times the length of your foot to your arm.  Or, maybe add three times the length of your foot to your arm.  The relative weighting of your foot and arm lengths is rather arbitrary since the sum you are calculating doesn’t have any specific use that could nail down the appropriate weighting.  The weighting of precision vs. recall in the F1 score is similarly arbitrary.  If you want to weight the recall more heavily, there is a metric called F2 that does that.  If you want to weight the precision more heavily, F0.5 does that.  In fact, there is a whole spectrum of F measures offering any weighting you want — you can find the formula in Wikipedia.  In our example of doctors screening for cancer, what is the right weighting to appropriately balance the potential loss of life by missing a cancer (low recall) against the cost and suffering of doing additional testing on many more patients that don’t have cancer (higher recall obtained at the expense of lower precision)?  I don’t know the answer, but it is almost certainly not F1.  Likewise, what is the appropriate weighting for predictive coding?  Probably not F1.

Why did we turn to the F1 score when comparing doctors doing cancer screenings?  We did it because we had two different recall values for the doctors, so we couldn’t compare precision values directly.  We used the F1 score to adjust for the tradeoff between precision and recall, but we did so with a weighting that was arbitrary (sometimes pronounced “wrong”).  Why were we stuck with two different recall values for the two doctors?  Unlike a predictive coding algorithm, we can’t ask a doctor to rank a group of patients based on how likely he/she thinks it is that each of them has cancer.  The doctor either feels a lump, or he/she doesn’t.  We might expand the doctor’s options to include “maybe” in addition to “yes” and “no,” but we can’t expect the doctor to say that one patient is a 85.39 score for cancer while another is a 79.82 so we can get a definite ordering. We don’t have that problem (normally) when we want to compare predictive coding algorithms — we can choose whatever recall level we are interested in and measure the precision of all algorithms at that recall, so we can compare apples to apples instead of apples to oranges.

Furthermore, a doctor’s ability to choose an appropriate threshold for sending people for additional testing is part of the measure of his/her ability, so we should allow him/her to decide how many people to send for additional testing, not just which people, and measure whether his/her choice strikes the right balance to achieve the best outcomes, which necessitates comparing different levels of recall for different doctors.  In predictive coding it is not the algorithm’s job to decide when we should stop looking for additional relevant documents — that is dictated by proportionality.  If the litigation is over a relatively small amount of money, a modest target recall may be accepted to keep review costs reasonable relative to the amount of money at stake.  If a great deal of money is at stake, pushing for a high recall that will require reviewing a lot of irrelevant documents may be warranted.  The point is that the appropriate tradeoff between low recall with high precision and high recall with lower precision depends on the economics of the case, so it cannot be captured by a statistic with fixed (arbitrary) weight like the F1 score.

Here is a graph of the F1 score for the three algorithms we’ve been looking at:

graph_F1

Remember that the F1 score can only be large if both the recall and precision are large.  At the left edge of the chart the recall is low, so the F1 score is small.  At the right edge the recall is high but the precision is typically low, so the F1 score is small.  Note that algorithm 1 has its maximum F1 score of 0.264 at 62% recall, while algorithm 3 has its maximum F1 score of 0.242 at 44% recall.  Comparing maximum F1 scores to identify the best algorithm is really an apples to oranges comparison (comparing values at different recall levels), and in this case it would lead you to conclude that algorithm 3 is the second best algorithm when we know that it is by far the worst algorithm at high recall.  Of course, you might retort that algorithms should be compared by comparing F1 scores at the same recall level instead of comparing maximum F1 scores, but the F1 score would really serve no purpose in that case — we could just compare precision values.

In summary, recall and precision are metrics that relate very directly to important aspects of document review — the need to identify a substantial portion of the relevant documents (recall), and the need to keep costs down by avoiding review of irrelevant documents (precision).  A predictive coding algorithm orders the document list to put the documents that are expected to have the best chance of being relevant at the top.  As the reviewer works his/her way down the document list recall will increase (more relevant documents found), but precision will typically decrease (increasing percentage of documents are not relevant).  The F1 score attempts to combine the precision and recall in a way that allows comparisons at different levels of recall by balancing increasing recall against decreasing precision, but it does so with a weighting between the two quantities that is arbitrary rather than reflecting the economics of the case.  It is better to compare algorithm performance by comparing precision at the same level of recall, with the recall chosen to be reasonable for a case.

Note:  You can read more about performance measures here, and there is an article on an alternative to F1 that is more appropriate for e-discovery here.