Should proportionality arguments allow producing parties to get away with poor productions simply because they wasted a lot of effort due to an extremely bad algorithm? This article examines one such bad algorithm that has been used in major review platforms, and shows that it could be made vastly more effective with a very minor tweak. Are lawyers who use platforms lacking the tweak committing malpractice by doing so?
Last year I was moderating a panel on TAR (predictive coding) and I asked the audience what recall level they normally aim for when using TAR. An attendee responded that it was a bad question because proportionality only required a reasonable effort. Much of the audience expressed agreement. This should concern everyone. If quality of result (e.g., achieving a certain level of recall) is the goal, the requesting party really has no business asking how the result was achieved–any effort wasted by choosing a bad algorithm is born by the producing party. On the other hand, if the target is expenditure of a certain amount of effort, doesn’t the requesting party have the right to know and object if the producing party has chosen a methodology that is extremely inefficient?
The algorithm I’ll be picking on today is a classifier called 1-nearest neighbor, or 1-NN. You may be using it without ever having heard that name, so pay attention to my description of it and see if it sounds familiar. To predict whether a document is relevant, 1-NN finds the single most similar training document and predicts the relevance of the unreviewed document to be the same. If a relevance score is desired instead of a yes/no relevance prediction, the relevance score can be taken to be the similarity value if the most similar training document is relevant, and it can be taken to be the negative of the similarity value if the most similar training document is non-relevant. Here is a precision-recall curve for the 1-NN algorithm used in a TAR 1.0 workflow trained with randomly-selected documents:
The precision falls off a cliff above 60% recall. This is not due to inadequate training–the cliff shown above will not go away no matter how much training data you add. To understand the implications, realize that if you sort the documents by relevance score and review from the top down until you reach the desired level of recall, 1/P at that recall tells the average number of documents you’ll review for each relevant document you find. At 60% recall, precision is 67%, so you’ll review 1.5 documents (1/0.67 = 1.5) for each relevant document you find. There is some effort wasted in reviewing those 0.5 non-relevant documents for each relevant document you find, but it’s not too bad. If you keep reviewing documents until you reach 70% recall, things get much worse. Precision drops to about 8%, so you’ll encounter so many non-relevant documents after you get past 60% recall that you’ll end up reviewing 12.5 documents for each relevant document you find. You would surely be tempted to argue that proportionality says you should be able to stop at 60% recall because the small gain in result quality of going from 60% recall to 70% recall would cost nearly ten times as much review effort. But does it really have to be so hard to get to 70% recall?
It’s very easy to come up with an algorithm that can reach higher recall without so much review effort once you understand why the performance cliff occurs. When you sort the documents by relevance score with 1-NN, the documents where the most similar training document is relevant will be at the top of the list. The performance cliff occurs when you start digging into the documents where the most similar training document is non-relevant. The 1-NN classifier does a terrible job of determining which of those documents has the best chance of being relevant because it ignores valuable information that is available. Consider two documents, X and Y, that both have a non-relevant training document as the most similar training document, but document X has a relevant training document as the second most similar training document and document Y has a non-relevant training document as the second most similar. We would expect X to have a better chance of being relevant than Y, all else being equal, but 1-NN cannot distinguish between the two because it pays no attention to the second most similar training document. Here is the result for 2-NN, which takes the two most similar training document into account:
Notice that 2-NN easily reaches 70% recall (1/P is 1.6 instead of 12.5), but it does have a performance cliff of its own at a higher level of recall because it fails to make use of information about the third most similar training document. If we utilize information about the 40 most similar training documents we get much better performance as shown by the solid lines here:
It was the presence of non-relevant training documents that tripped up the 1-NN algorithm because the non-relevant training document effectively hid the existence of evidence (similar training documents that were relevant) that a document might be relevant, so you might think the performance cliff could be avoided by omitting non-relevant documents from the training. The result of doing that is shown with dashed lines in the figure above. Omitting non-relevant training documents does help 1-NN at high recall, though it is still far worse than 40-NN with the non-relevant training documents include (omitting the non-relevant training documents actually harms 40-NN, as shown by the red dashed line). A workflow that focuses on reviewing documents that are likely to be relevant, such as TAR 2.0, rather than training with random documents, will be less impacted by 1-NN’s shortcomings, but why would you ever suffer the poor performance of 1-NN when 40-NN requires such a minimal modification of the algorithm?
You might wonder whether the performance cliff shown above is just an anomaly. Here are precision-recall curves for several additional categorization tasks with 1-NN on the left and 40-NN on the right.
Sometimes the 1-NN performance cliff occurs at high enough recall to allow a decent production, but sometimes it keeps you from finding even half of the relevant documents. Should a court accept less than 50% recall when the most trivial tweak to the algorithm could have achieved much higher recall with roughly the same amount of document review?
Of course, there are many factors beyond the quality of the classifier, such as the choice of TAR 1.0 (SPL and SAL), TAR 2.0 (CAL), or TAR 3.0 workflows, that impact the efficiency of the process. The research by Grossman and Cormack that courts have relied upon to justify the use of TAR because it reaches recall that is comparable to or better than an exhaustive human review is based on CAL (TAR 2.0) with good classifiers, whereas some popular software uses TAR 1.0 (less efficient if documents will be reviewed before production) and poor classifiers such as 1-NN. If the producing party vows to reach high recall and bears the cost of choosing bad software and/or processes to achieve that, there isn’t much for the requesting party to complain about (though the producing party could have a bone to pick with an attorney or service provider who recommended an inefficient approach). On the other hand, if the producing party argues that low recall should be tolerated because decent recall would require too much effort, it seems that asking whether the algorithms used are unnecessarily inefficient would be appropriate.