There has been a great deal of debate about whether it is wise or possibly even required to disclose seed sets (training documents, possibly including non-relevant documents) when using predictive coding. This article explains why disclosing seed sets may provide far less transparency than people think.
The rationale for disclosing seed sets seems to be that the seed set is the input to the predictive coding system that determines which documents will be produced, so it is reasonable to ask for it to be disclosed so the requesting party can be assured that they will get what they wanted, similar to asking for a keyword search query to be disclosed.
Some argue that the seed set may be work product (if attorneys choose which documents to include rather than using random sampling). Others argue that disclosing non-relevant training documents may reveal a bad act other than the one being litigated. If the requesting party is a competitor, the non-relevant training documents may reveal information that helps them compete. Even if the producing party is not concerned about any of the issues above, it may be reluctant to disclose the seed set due to fear of establishing a precedent it may not want to be stuck with in future cases having different circumstances.
Other people are far more qualified to debate the legal and strategic issues than I am. Before going down that road, I think it’s worthwhile to consider whether disclosing seed sets really provides the transparency that people think. Some reasons why it does not:
- If you were told that the producing party would be searching for evidence of data destruction by doing a keyword search for “shred AND documents,” you could examine that query and easily spot deficiencies. A better search might be “(shred OR destroy OR discard OR delete) AND (documents OR files OR records OR emails OR evidence).” Are you going to review thousands of training documents and realize that one relevant training document contains the words “shred” and “documents” but none of the training documents contain “destroy” or “discard” or “files”? I doubt it.
- You cannot tell whether the seed set is sufficient if you don’t have access to the full document population. There could be substantial pockets of important documents that are not represented in the seed set–how would you know? The producing party has access to the full population, so they can do statistical sampling to measure the quality (based on number of relevant documents, not their importance) of the predictions the training set will produce. The requesting party cannot do that–they have no way of assessing adequacy of the training set other than wild guessing.
- You cannot tell whether the seed set is biased just by looking at it. Again, if you don’t have access to the full population, how could you know if some topic or some particular set of keywords is under or over represented? If training documents were selected by searching for “shred AND Friday,” the system would see both words on all (or most) of the relevant documents and would think both words are equally good indicators of relevance. Would you notice that all the relevant training documents happen to contain the word “Friday”? I doubt it.
- Suppose you see an important document in the seed set that was correctly tagged as being relevant. Can you rest assured that similar documents will be produced? Maybe not. Some classification algorithms can predict a document to be non-relevant when it is a near-dupe or even an exact dupe of a relevant training document. I described how that could happen in this article. How can you claim that the seed set provides transparency if you don’t even know if a near-dupe of a relevant training document will be produced?
- Poor training doesn’t necessarily mean that relevant documents will be missed. If a relevant document fails to match a keyword search query, it will be missed, so ensuring that the query is good is important. Most predictive coding systems generate a relevance score for each document, not just a binary yes/no relevance prediction like a search query. Whether or not the predictive coding system produces a particular relevant document doesn’t depend solely on the training set–the producing party must choose a cutoff point in the ranked document list that determines which documents will be produced. A poorly trained system can still achieve high recall if the relevance score cutoff is chosen to be low enough. If the producing party reviews all documents above the relevance score cutoff before producing them, a poorly trained system will require a lot more document review to achieve satisfactory recall. Unless there is talk of cost shifting, or the producing party is claiming it should be allowed to stop at modest recall because reaching high recall would be too expensive, is it really the requesting party’s concern if the producing party incurs high review costs by training the system poorly?
- One might argue that the producing party could stack the seed set with a large number of marginally relevant documents while avoiding really incriminating documents in order to achieve acceptable recall while missing the most important documents. Again, would you be able to tell that this was done by merely examining the seed set without having access to the full population? Is the requesting party going to complain that there is no smoking gun in the training set? The producing party can simply respond that there are no smoking guns in the full population.
- The seed set may have virtually no impact on the final result. To appreciate this point we need to be more specific about what the seed set is, since people use the term in many different ways (see Grossman & Cormack’s discussion). If the seed set is taken to be a judgmental sample (documents selected by a human, perhaps using keyword search) that is followed by several rounds of additional training using active learning, the active learning algorithm is going to have a much larger impact on the final result than the seed set if active learning contributes a much larger number of relevant documents to the training. In fact, the seed set could be a single relevant document and the result would have almost no dependence on which relevant document was used as the seed (see the “How Seed Sets Influence Which Documents are Found” section of this article). On the other hand, if you take a much broader definition of the seed set and consider it to be all documents used for training, things get a little strange if continuous active learning (CAL) is used. With CAL the documents that are predicted to be relevant are reviewed and the reviewers’ assessments are fed back into the system as additional training to generate new predictions. This is iterated many times. So all documents that are reviewed are used as training documents. The full set of training documents for CAL would be all of the relevant documents that are produced as well as all non-relevant documents that were reviewed along the way. Disclosing the full set of training documents for CAL could involve disclosing a very large number of non-relevant documents (comparable to the number of relevant documents produced).
Trying to determine whether a production will be good by examining a seed set that will be input into a complex piece of software to analyze a document population that you cannot access seems like a fool’s errand. It makes more sense to ask the producing party what recall it achieved and to ask questions to ensure that recall was measured sensibly. Recall isn’t the whole story–it measures the number of relevant documents found, not their importance. It makes sense to negotiate the application of a few keyword searches to the documents that were culled (predicted to be non-relevant) to ensure that nothing important was missed that could easily have been found. The point is that you should judge the production by analyzing the system’s output, not the training data that was input.