You may already be familiar with the precision-recall curve, which describes the performance of a predictive coding system. Unfortunately, the precision-recall curve doesn’t (normally) display any information about the cost of training the system, so it isn’t convenient when you want to compare the effectiveness of different training methodologies. This article looks at the gain curve, which is better suited for that purpose.
The gain curve shows how the recall achieved depends on the number of documents reviewed (slight caveat to that at the end of the article). Recall is the percentage of all relevant documents that have been found. High recall is important for defensibility. Here is an example of a gain curve (click to enlarge):
The first 12,000 documents reviewed in this example are randomly selected documents used to train the system. Prevalence is very low in this case (0.32%), so finding relevant documents using random selection is hard. The system needs to be exposed to a large enough number of relevant training documents for it to learn what they look like so it can make good predictions for the relevance of the remaining documents.
After the 12,000 training documents are reviewed the system orders the remaining documents to put the ones that are most likely to be relevant (based on patterns detected during training) at the top of the list. To distinguish the training phase from the review phase I’ve shown the training phase as a solid line and review phase as a dashed line. Review of the remaining documents starts at the top of the sorted list. The gain curve is very steep at the beginning of the review phase because most of the documents being reviewed are relevant, so they have a big impact on recall. As the review progresses the gain curve becomes less steep because you end up reviewing documents that are less likely to be relevant. Review proceeds until a desired level of recall, such as 75% (the horizontal dotted line), is achieved. The goal is to find the system and workflow that achieves the recall target at the lowest cost (i.e., the one that crosses the dotted line farthest to the left, with some caveats below).
What is the impact of using the same system with a larger or smaller number of randomly selected training documents? This figure shows the gain curves for 9,000 and 15,000 training documents in addition to the 12,000 training document curve seen earlier:
If the goal is to reach 75% recall, 12,000 is the most efficient option among the three considered because it crosses the horizontal dotted line with the least document review. If the target was a lower level of recall, such as 70%, 9,000 training documents would be a better choice. A larger number of training documents usually leads to better predictions (the gain curve stays steep longer during the review phase), but there is a point where the improvement in the predictions isn’t worth the cost of reviewing additional training documents.
The discussion above assumed that the cost of reviewing a document during the training phase is the same as the cost of reviewing a document during the review phase. That will not be the case if expensive subject matter experts are used to review the training documents and low-cost contract reviewers are used for the review phase. In that situation, the optimal result is less straightforward to identify from the gain curve.
In some situations it may be possible produce documents without reviewing them if there is no concern about disclosing privileged documents (because there are none or because they are expected to be easy to identify by looking at things like the sender/recipient email address) or non-relevant documents (because there is no concern about them containing trade secrets or evidence of bad acts not covered by the current litigation). When it is okay to produce documents without reviewing them, the document review associated with the dashed part of the curve can be eliminated in whole or in part. For example, documents predicted to be relevant with high confidence may be produced without review (unless they are identified as potential privileged), whereas documents with a lower likelihood of being relevant might be reviewed to avoid disclosing too many non-relevant documents. Again, the gain curve would not show the optimal choice in a direct way–you would need to balance the potential harm (even if small) of producing non-relevant documents against the cost of additional training.
The predictive coding process described in this article, random training documents followed by review (with no additional learning by the algorithm), is sometimes known as Simple Passive Learning (SPL), which is one example of a TAR 1.0 workflow. To determine the optimal point to switch from training to review with TAR 1.0, a random set of documents known as a control set is reviewed and used to monitor learning progress by comparing the predictions for the control set documents to their actual relevance tags. Other workflows and analysis of their efficiency via gain curves will be the subject of my next article.