Predictive coding tools are supposed to learn to separate relevant documents from non-relevant documents by analyzing examples in a training set provided by a human reviewer. If you tagged a training document as relevant and asked the software to predict whether an exact duplicate or a near-dupe of the very same document was relevant, how annoyed would you be if its answer was “no”? For many (but not all) algorithms, that’s entirely possible.
To see how that can happen, we’ll use the linear support vector machine (SVM) as an example, but the reasoning is similar for many classification algorithms. The documents are viewed as points in a high-dimensional space, where the position of a document is determined by the frequencies of the words in the document or other features being analyzed. This figure conveys the basic idea in two dimensions:
The linear SVM algorithm tries to find the best possible line (technically, a hyperplane when you have more than two dimensions) to separate relevant documents from non-relevant documents for the training set. The details of how it does that don’t matter for this discussion. The important point is that it will often be impossible to separate the relevant documents from the non-relevant documents perfectly with a line (or hyperplane) — there will be some training documents that end up on the wrong side of the line, like the orange dot toward the lower left corner of the figure above.
When the linear SVM model makes predictions for the remainder of the document set, those predictions are based solely on the line (hyperplane) that the algorithm fit to the training data. When you ask it to make a prediction for a document, the fact that the document is extremely similar, or even identical, to one of the training documents is not taken into account. All that matters for the prediction is where the document lies relative to the line. So, if orange dots represent relevant documents, the prediction for a document that is a dupe or near-dupe of the outlier orange document toward the lower left corner of the figure above will be that it is not relevant (the near-dupe will be at nearly the same location as the training document in the figure). This is not limited to linear SVM — any algorithm that cannot fit the training data perfectly will not be able to make correct predictions for near-dupes of the training documents that it couldn’t fit. Effectively, the algorithm overrides the human reviewer’s decision for documents similar to the outlier. That’s perfectly reasonable when an algorithm is applied to noisy data, but is it appropriate for e-discovery?
Of course, the outlier training document may have been categorized incorrectly by the human reviewer (it truly is “noise”), and the algorithm’s decision to ignore it in that case would give us a better result. On the other hand, the human reviewer may have categorized the document correctly — there is sometimes more subtlety to understanding documents than can be captured by looking at them as points in word-frequency space.
What does this mean for defensibility? Presumably, predictive coding users are testing the results to ensure that the number of mistakes (relevant documents that were missed) is not too large. If the number of mistakes is small, is that sufficient even if some of the mistakes are embarrassingly bad?
It is possible to paper over this problem. After training the classification algorithm, have it make predictions for the documents in the training set. Identify the relevant training documents that the algorithm predicts (incorrectly) are not relevant, and flag any documents in the full population that are near-dupes (or have high conceptual similarity) of them for human review.
Note: Some have objected to my use of the words “paper over” in the previous paragraph. My reason for putting it that way is that the underlying reason the training document was categorized incorrectly is not being addressed, and any arbitrary cutoff (e.g. manual review of documents that are at least 80% near-dupes) is going to leave some documents (e.g. 79% near-dupes) uncorrected.