Tag Archives: predictive coding

1 Reply

Predictive coding software analyzes documents that a human reviewer has tagged as relevant or not relevant to learn to identify relevant documents. The software may produce a binary yes/no relevance prediction for each unreviewed document, or it may produce a relevance score. This article aims to clear up some confusion about what the relevance score measures, which should make its importance clear.

Ralph Losey’s recent article “Relevancy Ranking is the Key Feature of Predictive Coding Software” generated some debate and controversy reflected in the readers’ comments. To appreciate the value in producing a relevance score or ranking, rather than just a yes/no relevance prediction for each document, it is critical to understand what the relevance score really measures.

Predictive coding software that produces a relevance score allows documents to be ordered, or ranked, while software that produces a yes/no relevance prediction only allows documents to be separated into two unordered sets. What does the ordering generated by the relevance score mean? What causes a document to float to the top when predictive coding is applied? The documents at the top are not the most relevant documents, contrary to misconceptions encouraged by sloppy language on the subject. Rather, they are the documents that the algorithm thinks are most likely to be relevant. If there is a “hot document” or a “smoking gun” in the document set, there is no reason to believe that it will be the first, or even the second document in the ordered list. The algorithm will put the document it is most confident is relevant at the top of the list.

What gives an algorithm confidence that a document is relevant, causing it to move to the top of the list? That depends on the algorithm. An algorithm may have high confidence if the document is highly similar to one of the human-reviewed documents from the training set. If the training set doesn’t contain any hot documents, it is unlikely that any hot documents in the rest of the document set will get high scores. An algorithm may have high confidence if the document contains a particular word with high frequency (number of occurrences divided by the number of words in the document). If presence of the word “fraud” is seen as a good indicator that the document is relevant, an algorithm may take a document using the word “fraud” frequently to have a high probability of relevance, and therefore assign a high score. It will assign a higher score to a document saying “Fraud is not tolerated here. Fraud will be reported to the police.” than to a document saying “If we get caught, we’re going to go to jail for fraud.”

This diagram gives a reasonable, but simplified, picture of what a training set might look like to a predictive coding classification algorithm:

The dots along the top represent documents that were reviewed by a human, with relevant documents shown in orange and non-relevant documents shown in blue. As you move from left to right some feature that is a good indicator of relevance (meaning that it helps in distinguishing relevant documents from non-relevant documents) increases in frequency. For example, the frequency of a word like “fraud” might increase as you go to the right. The vertical position of the documents doesn’t matter — you can think of vertical position as specifying the frequency of a feature that is not a good indicator of relevance (e.g. the frequency of the word “office”). The algorithm is commonly given only yes/no values for relevance as input, so it has no way of knowing whether some of the orange documents are more important than others. What it does know is that a document has a higher probability of being relevant, assuming that the training set is representative of the document set in general, if it is farther to the right. The algorithm might even estimate the probability of a document being relevant based on how far to the right it is, as shown by the red curve below the dots.

When the algorithm is asked to make predictions for a document that has not been reviewed by a human, it can measure the frequency of the indicator word (e.g. “fraud”) for the document and spit out the probability estimate that was generated from the training data. Rather than the actual probability estimate, it could use any monotonic function of the probability estimate (i.e. any quantity that increases when the probability increases) as the relevance score.

Documents with a high relevance score are the documents that have the highest probability (as far as the algorithm can tell) of being relevant. They are the documents that the algorithm is confident are relevant. They are the low-hanging fruit if your goal is to find the largest number of relevant documents without wading through a lot of non-relevant documents. They are not necessarily the most informative or most relevant documents. Rather, they are the relevant documents that are easiest for the algorithm to find. Documents with a very low relevance score are the documents that have the lowest probability of being relevant. What about the documents between those extremes? Those are the documents where the algorithm just isn’t sure. The features that the algorithm uses to separate the documents that are relevant from those that aren’t simply don’t work well for those documents. Actual relevance for those documents may depend on words that the algorithm isn’t paying attention to, or it may depend on a very specific ordering of the words when the algorithm is ignoring word order, or it may depend on a subtle contextual detail that only a human could appreciate.

The relevance score indicates how well the algorithm understands the document based on experience with other similar documents in the training set, so it is clearly specific to the algorithm and is not an inherent property of the document. Different algorithms will do a better/worse job of finding features that separate relevant documents from non-relevant documents, and different algorithms will do a better/worse job of modeling the probability of a document being relevant based on experience with the training set. All of these things will impact the precision-recall curve.

At this point, the difference between an algorithm that produces a relevance score and an algorithm that produces a yes/no relevance prediction should be clear. An algorithm that produces a score “knows what it doesn’t know,” while an algorithm that produces a yes/no doesn’t (or it knows and refuses to tell you). An algorithm that produces a score tells you which documents it is uncertain about. You could always convert a score to a yes/no by picking an arbitrary cutoff for the score, say 50, and proclaiming that only documents reaching that threshold are predicted to be relevant. So, two documents with scores of 49 and 50, which might even be near-dupes, would be predicted to be non-relevant and relevant respectively. That should make the arbitrariness of a yes/no result clear. Whenever you produce a yes/no result, even if it doesn’t involve an explicit score and cutoff, there will be the possibility that two very similar documents will generate completely different outputs because a binary output does not allow the similarity of the inputs to be expressed.

If you are the producing party in a case, and you plan to review all documents before producing them, you can produce the largest number of responsive documents (one can certainly debate whether that is the most appropriate goal) at the lowest cost by ordering the documents by the relevance score to bring the low-hanging fruit to the top. As you work your way down the document list, you will have to review more and more non-responsive documents for each responsive document you hope to find (think about going from right to left in the diagram above). As described in my previous article, you can compute how many documents you’ll need to review to achieve a desired level of recall using the precision-recall curve, and the appropriate stopping point (recall level) for the case can be determined from proportionality and the cost curve. If the algorithm produces a yes/no prediction instead of a relevance score, you won’t have a precision-recall curve, just a single precision-recall point. The stopping point will be dictated by the software rather than proportionality and the value of the case.

Sidenote: The diagram shows the simplest possible example to illustrate the ideas in the article. It is, in fact, so simple that it would apply equally well to a keyword search for a single word if the search algorithm used word frequency to order the search results (many do). A more realistic example would involve many dimensions and relevance score contours that could not be achieved with a search engine, but the idea still holds — middling scores should occur for documents that reside in areas of the document space where relevant and non-relevant documents aren’t well-separated.

Predictive Coding Performance and the Silly F1 Score

3 Replies

This article describes how to measure the performance of predictive coding algorithms for categorizing documents. It describes the precision and recall metrics, and explains why the F₁ score (also known as the F-measure or F-score) is virtually worthless.

Predictive coding algorithms start with a training set of example documents that have been tagged as either relevant or not relevant, and identify words or features that are useful for predicting whether or not other documents are relevant. “Relevant” will usually mean responsive to a discovery request in litigation, or having a particular issue code, or maybe privileged (although predictive coding may not be well-suited for identifying privileged documents). Most predictive coding algorithms will generate a relevance score or rank for each document, so you can order the documents with the ones most likely to be relevant (according to the algorithm) coming first and the ones most likely to not be relevant coming toward the end of the list. If you apply several different algorithms to the same set of documents and generate several ordered lists of documents, what quantities should you compute to assess which algorithm made the best predictions for this document set?

You could select some number of documents, n, from the top of each list and count how many of the documents truly are relevant. Divide the number of relevant documents by n and you have the precision, i.e. the fraction of selected documents that are relevant. High precision is good since it means that the algorithm has done a good job of moving the relevant documents to the top of the list. The other useful thing to know is the recall, which is the fraction of all relevant documents in the document set that were included in the algorithm’s top n documents. Have we found 80% of the relevant documents, or only 10%? If the answer is 10%, we probably need to increase n, i.e. select a larger set of top documents, if we are going to argue to a judge that we’re making an honest effort at finding relevant documents. As we increase n, the recall will increase each time we encounter another document that is truly relevant. The precision will typically decrease as we increase n because we are including more and more documents that the algorithm is increasingly pessimistic about. We can measure precision and recall for many different values of n to generate a graph of precision as a function of recall (n is not shown explicitly, but higher recall corresponds to higher n values — the relationship is monotonic but not linear). Click the graph to view the full-sized version:

The graph shows hypothetical results for three different algorithms. Focus first on the blue curve representing the first algorithm. At 10% recall it shows a precision of 69%. So, if we work our way down from the top of the document list generated by algorithm 1 and review documents until we’ve found 10% of the documents that are truly relevant, we’ll find that 69% of the documents we encounter are truly relevant while 31% are not relevant. If we continue to work our way down the document list, reviewing documents that the algorithm thinks are less and less likely to be relevant, and eventually get to the point where we’ve encountered 70% of the truly relevant documents (70% recall), 42% of the documents we review along the way will be truly relevant (42% precision) and 58% will not be relevant.

Turn now to the second algorithm, which is shown in green. For all values of recall it has a lower precision than the first algorithm. For this document set it is simply inferior (glossing over subtleties like result diversity) to the first algorithm — it returns more irrelevant documents for each truly relevant document it finds, so a human reviewer will need to wade through more junk to attain a desired level of recall. Of course, algorithm 2 might triumph on a different document set where the features that distinguish a relevant document are different.

The third algorithm, shown in orange, is more of a mixed bag. For low recall (left side of graph) it has higher precision than any of the other algorithms. For high recall (right side of graph) it has the worst precision of the three algorithms. If we were designing a web search engine to compete with Google, algorithm 3 might be pretty attractive because the precision at low recall is far more important than the precision at high recall since most people will only look at the first page or two of search results, not the 1000^th page. E-Discovery is very different from web search in that regard — you need to find most of the relevant documents, not just the 10 or 20 best ones. Precision at high recall is critical for e-discovery, and that is where algorithm 3 falls flat on its face. Still, there is some value in having high precision at low recall since it may help you decide early in the review that the evidence against your side is bad enough to warrant settling immediately instead of continuing the review.

You may have noticed that all three algorithms have 15% precision at 100% recall. Don’t take that to mean that they are in any sense equally good at high recall — they are actually all completely worthless at 100% recall. In this example, the prevalence of relevant documents is 15%, meaning that 15% of the documents in the entire document set are relevant. If your algorithm for finding relevant documents was to simply choose documents randomly, you would achieve 15% precision for all recall values. What makes algorithm 3 a disaster at high recall is the fact that it drops close to 15% precision long before reaching 100% recall, losing all ability to differentiate between documents that are relevant and those that are not relevant.

As alluded to earlier, high precision is desirable to reduce the amount of manual document review. Let’s make that idea more precise. Suppose you are the producing party in a case. You need to produce a large percentage of the responsive documents to satisfy your duty to the court. You use predictive coding to order the documents based on the algorithm’s prediction of which documents are most likely to be responsive. You plan to manually review any documents that will be produced to the other side (e.g., to verify responsiveness, check for privilege, perform redactions, or just be aware of the evidence you’ll need to counter in court), so how many documents will you need to review, including non-responsive documents that the algorithm thought were responsive, to reach a reasonable recall? Here is the formula (excluding training and validation sets):

$\text{fraction\_of\_document\_set\_to\_review} = \frac{\text{prevalence} \times \text{recall}}{\text{precision}}$

The recall is the desired level you want to reach, and the precision is measured at that recall level. The prevalence is a property of the document set, so the only quantity in the equation that depends on the predictive coding algorithm is the precision at the desired recall. Here is a graph based on the precision vs. recall relationships from earlier:

If your goal is to find at least 70% of the responsive documents (70% recall), you’ll need to review at least 25% of the documents ranked most highly by algorithm 1. Keep in mind that only 15% of the whole document set is responsive in our example (i.e. 15% prevalence), so aiming to find 70% of the responsive documents by reviewing 25% of the document set means reviewing 10.5% of the document set that is responsive (70% of 15%) and 14.5% of the document set that is not responsive, which is consistent with our precision-recall graph showing 42% precision at 70% recall (10.5/25 = 0.42) for algorithm 1. If you had the misfortune of using algorithm 3, you would need to review 50% of the entire document set just to find 70% of the responsive documents. To achieve 70% recall you would need to review twice as many documents with algorithm 3 compared to algorithm 1 because the precision of algorithm 3 at 70% recall is half the precision of algorithm 1.

Notice how the graph slopes upward more and more rapidly as you aim for higher recall because it becomes harder and harder to find a relevant document as more and more of the low hanging fruit gets picked. So, what recall should you aim for in an actual case? This is where you need to discuss the issue of proportionality with the court. Each additional responsive document is, on average, more expensive than the last one, so a balance must be struck between cost and the desire to find “everything.” The appropriate balance will depend on the matter being litigated.

We’ve seen that recall is important to demonstrate to the court that you’ve found a substantial percentage of the responsive documents, and we’ve seen that precision determines the number of documents that must be reviewed (hence, the cost) to achieve a desired recall. People often quote another metric, the F₁ score (also known as the F-measure or F-score), which is the harmonic mean of the recall and the precision:

$F_1 = \frac{1}{\frac{1}{2}(\frac{1}{\text{recall}}+\frac{1}{\text{precision}})} = \frac{2 \times \text{precision} \times \text{recall}}{\text{precision} + \text{recall}}$

The F₁ score lies between the value of the recall and the value of the precision, and tends to lie closer to the smaller of the two, so high values for the F₁ score are only possible if both the precision and recall are large.

Before explaining why the F₁ score is pointless for measuring predictive coding performance, let’s consider a case where it makes a little bit of sense. Suppose we send the same set of patients to two different doctors who will each screen them for breast cancer using the palpation method (feeling for lumps). The first doctor concludes that 50 of them need further testing, but the additional testing shows that only 3 of them actually have cancer, giving a precision of 6.0% (these numbers are entirely made up and are not necessarily realistic). The second doctor concludes that 70 of the patients need further testing, but additional testing shows that only 4 of them have cancer, giving a precision of 5.7%. Which doctor is better at identifying patients in need of additional testing? The first doctor has higher precision, but that precision is achieved at a lower level of recall (only found 3 cancers instead of 4). We know that precision tends to decline with increasing recall, so the fact that the second doctor has lower precision does not immediately lead to the conclusion that he/she is less capable. Since the F₁ score combines precision and recall such that increases in one offset (to some degree) decreases in the other, we could compute F₁ scores for the two doctors. To compute F₁ we need to compute the recall, which means that we need to know how many of the patients actually have cancer. If 5 have cancer, the F₁ scores for the doctors will be 0.1091 and 0.1067 respectively, so the first doctor scores higher. If 15 have cancer, the F₁ scores will be 0.0923 and 0.0941 respectively, so the second doctor scores higher. Increasing the number of cancers from 5 to 15 decreases the recall values, bringing them closer to the precision values, which causes the recall to have more impact (relative to the precision) on the F₁ score.

The harmonic mean is commonly used to combine rates. For example, you should be able to convince yourself that the appropriate way to compute the average MPG fuel efficiency rating for a fleet of cars is to take the harmonic mean (not the arithmetic mean) of the MPG values of the individual cars. But, the F₁ score is the harmonic mean of two rates having different meanings, not the same rate measured for two different objects. It’s like adding the length of your foot to the length of your arm. They are both lengths, but does the result from adding them really make any sense? A 10% change in the length of your arm would have much more impact than a 10% change in the length of your foot, so maybe you should add two times the length of your foot to your arm. Or, maybe add three times the length of your foot to your arm. The relative weighting of your foot and arm lengths is rather arbitrary since the sum you are calculating doesn’t have any specific use that could nail down the appropriate weighting. The weighting of precision vs. recall in the F₁ score is similarly arbitrary. If you want to weight the recall more heavily, there is a metric called F₂ that does that. If you want to weight the precision more heavily, F_0.5 does that. In fact, there is a whole spectrum of F measures offering any weighting you want — you can find the formula in Wikipedia. In our example of doctors screening for cancer, what is the right weighting to appropriately balance the potential loss of life by missing a cancer (low recall) against the cost and suffering of doing additional testing on many more patients that don’t have cancer (higher recall obtained at the expense of lower precision)? I don’t know the answer, but it is almost certainly not F₁. Likewise, what is the appropriate weighting for predictive coding? Probably not F₁.

Why did we turn to the F₁ score when comparing doctors doing cancer screenings? We did it because we had two different recall values for the doctors, so we couldn’t compare precision values directly. We used the F₁ score to adjust for the tradeoff between precision and recall, but we did so with a weighting that was arbitrary (sometimes pronounced “wrong”). Why were we stuck with two different recall values for the two doctors? Unlike a predictive coding algorithm, we can’t ask a doctor to rank a group of patients based on how likely he/she thinks it is that each of them has cancer. The doctor either feels a lump, or he/she doesn’t. We might expand the doctor’s options to include “maybe” in addition to “yes” and “no,” but we can’t expect the doctor to say that one patient is a 85.39 score for cancer while another is a 79.82 so we can get a definite ordering. We don’t have that problem (normally) when we want to compare predictive coding algorithms — we can choose whatever recall level we are interested in and measure the precision of all algorithms at that recall, so we can compare apples to apples instead of apples to oranges.

Furthermore, a doctor’s ability to choose an appropriate threshold for sending people for additional testing is part of the measure of his/her ability, so we should allow him/her to decide how many people to send for additional testing, not just which people, and measure whether his/her choice strikes the right balance to achieve the best outcomes, which necessitates comparing different levels of recall for different doctors. In predictive coding it is not the algorithm’s job to decide when we should stop looking for additional relevant documents — that is dictated by proportionality. If the litigation is over a relatively small amount of money, a modest target recall may be accepted to keep review costs reasonable relative to the amount of money at stake. If a great deal of money is at stake, pushing for a high recall that will require reviewing a lot of irrelevant documents may be warranted. The point is that the appropriate tradeoff between low recall with high precision and high recall with lower precision depends on the economics of the case, so it cannot be captured by a statistic with fixed (arbitrary) weight like the F₁ score.

Here is a graph of the F₁ score for the three algorithms we’ve been looking at:

Remember that the F₁ score can only be large if both the recall and precision are large. At the left edge of the chart the recall is low, so the F₁ score is small. At the right edge the recall is high but the precision is typically low, so the F₁ score is small. Note that algorithm 1 has its maximum F₁ score of 0.264 at 62% recall, while algorithm 3 has its maximum F₁ score of 0.242 at 44% recall. Comparing maximum F₁ scores to identify the best algorithm is really an apples to oranges comparison (comparing values at different recall levels), and in this case it would lead you to conclude that algorithm 3 is the second best algorithm when we know that it is by far the worst algorithm at high recall. Of course, you might retort that algorithms should be compared by comparing F₁ scores at the same recall level instead of comparing maximum F₁ scores, but the F₁ score would really serve no purpose in that case — we could just compare precision values.

In summary, recall and precision are metrics that relate very directly to important aspects of document review — the need to identify a substantial portion of the relevant documents (recall), and the need to keep costs down by avoiding review of irrelevant documents (precision). A predictive coding algorithm orders the document list to put the documents that are expected to have the best chance of being relevant at the top. As the reviewer works his/her way down the document list recall will increase (more relevant documents found), but precision will typically decrease (increasing percentage of documents are not relevant). The F₁ score attempts to combine the precision and recall in a way that allows comparisons at different levels of recall by balancing increasing recall against decreasing precision, but it does so with a weighting between the two quantities that is arbitrary rather than reflecting the economics of the case. It is better to compare algorithm performance by comparing precision at the same level of recall, with the recall chosen to be reasonable for a case.

Note: You can read more about performance measures here, and there is an article on an alternative to F₁ that is more appropriate for e-discovery here.

Clustify Blog – eDiscovery, Document Clustering, Technology-Assisted Review (Predictive Coding), Information Retrieval, and Software Development

Thoughts on e-discovery, computers, and software development.

Tag Archives: predictive coding

Is Predictive Coding Overriding Lawyers?

Real-Time Predictive Coding

The Meaning of Relevance Score

Predictive Coding Performance and the Silly F1 Score