I came across this article today, and I think it is important for everyone to be aware of it. It says that SSDs (solid-state drives), which are becoming increasingly popular for computer storage due to their fast access times and ability to withstand being dropped, “need consistent access to a power source in order for them to not lose data over time. There are a number of factors that influence the non-powered retention period that an SSD has before potential data loss. These factors include amount of use the drive has already experienced, the temperature of the storage environment, and the materials that comprise the memory chips in the drive.” Keep that risk in mind if computers are powered down during a legal hold. The article gives details about how long the drives are supposed to retain data while powered down.
Highlights from the NorCal eDiscovery & Information Governance Retreat 2015
The NorCal eDiscovery & Information Governance Retreat is part of the series of retreats
held by Chris La Cour’s company, Ing3nious. This one was held at the Meritage Resort & Spa in Napa, California. As always, the venue was beautiful, the food was good, and the talks were informative. You can find all of my photos from the retreat and the nearby Skyline Wilderness Park here. My notes below offer a few highlights from the sessions I attended. There were often two sessions occurring simultaneously, so I couldn’t attend everything.
Keynote: Only the Paranoid Survive: What eDiscovery Needs to Survive the Big Data Tsunami
The keynote was by Alex Ponce de Leon from Google. He made the point that there is a
difference between Big Data, which can be analyzed, and “lots and lots of data.” For information governance, lots of data is a problem. The excitement over Big Data (he showed this graph and this one) is turning people into digital hoarders–they are saving things that will never be useful, which causes problems for ediscovery. He mentioned that DuPont analyzed the documents they had to review for a case and found that 50% of them should have been discarded according to their retention policy, resulting in $12 million in document review that wouldn’t have been necessary if the retention policy had been followed (this article discusses it). Legal and ediscovery people need to take the lead in getting companies to not keep everything.
Establishing In-House eDiscovery Playbooks, Procedures, Tool Selection, and Implementation
There was some discussion about corporations acquiring e-discovery tools and whether that
caused concerns from outside counsel about what was being done since they must sign off on it. Ben Robbins of LinkedIn said they haven’t had significant problems with that. The panel emphasized the importance of documenting procedures and making sure that different types of matters were addressed individually.
Cybersecurity…it’s what’s for dinner. So, what’s the recipe and who’s the head chef?
I couldn’t attend this one.
A Look Back on Model eDiscovery Orders
Judge Rader’s e-discovery model order (here is a related article), which limits discovery to five custodians and five search terms per custodian, was discussed. It was motivated by a need to curtail patent trolls in the Eastern District of Texas who were using ediscovery costs as a weapon. It was mentioned that discovery of backups may become more feasible as people move away from using tape for backups. Producing reports rather than raw databases was discussed, with the point being made that standard reports are usually okay, but custom reports often don’t match the requesting party’s expectations and cause conflicts. Model orders go out the window when dealing with government agencies–many want everything.
Information Governance and Security: Keeping Security in Sight
I couldn’t attend this one.
How to Leverage Information Governance for Better eDiscovery
I couldn’t attend this one.
Avoiding Land Mines in TAR
I was on this panel, so I didn’t take notes.
Managing BYOC/D and Wearables in International eDiscovery and Investigations
I couldn’t attend this one.
Social Media – eDiscovery’s “friend”?
An employee may see a social media account as personal, but it must be preserved (possibly
for years). Need to remind the employee of the hold. Don’t friend represented opposition, but okay to friend witnesses if you are up front about why. Lawyers can friend judges, but not if they have a case before them. You should read your judge’s tweets to see if there is a sign of bias. Getting data from a social media company is difficult. Look to see if jurors are tweeting about the case.
Inside the Threat Matrix: Cyber Security Risks, Incident Response, and the Discovery Impact
I couldn’t attend this one.
Resolving the Transparency Paradox
TAR 1.0 has a lot of foreign concepts like “stabilization” (optimal training), whereas TAR 2.0 (continuous active learning) is more like traditional review. Hal Marcus of Recommind mentioned that when he surveyed the audience at another event, many said they had used predictive coding but few disclosed doing so. The panel discussed allowing the requesting party to provide a seed set to make them feel better about using TAR, or raising the possibility of using TAR early on to see if there is pushback. The Coalition of Technology Resources for Lawyers has a database of case law on predictive coding that was mentioned.
Judicial Panel
Judges now get ediscovery. They see a lack of communication. Responding parties object to everything. Judges are unlikely to interfere when the parties have a thought-out ediscovery plan. Inside counsel are taking more control to reduce costs. The RAND study “Where the Money Goes” was mentioned. Regarding cost shifting, an attorney may choose to pay to have more control.
The Single Seed Hypothesis
This article shows that it is often possible to find the vast majority of the relevant documents in a collection by starting with a single relevant seed document and using continuous active learning (CAL). This has important implications for making review efficient, making predictive coding practical for smaller document sets, and putting eyes on relevant documents as early as possible, perhaps leading to settlement before too much is spent on document review. It also means that meticulously constructing seed sets and arguing about them with opposing counsel is probably a waste of time if CAL is used.
In one of the sessions at the ACEDS 2014 conference Bill Speros advocated using judgmental sampling (e.g., keyword search) to find relevant documents as training examples for predictive coding rather than using random sampling, which will not find many relevant documents if prevalence is low. I thought to myself that, while I agree with the premise, he should really warn about the possibility of bias and the fact that any probability estimates or relevance scores generated by the classification algorithm could be very distorted. To illustrate the problem of bias I decided that I would do an experiment when I got home. I would start with a single relevant training document and see how many of the relevant documents the system could find. I expected it to find only a subset of the relevant documents that were similar to the single seed document, showing that a set of seed documents that doesn’t have good coverage of the various relevant concepts for the case (i.e., a seed set that is biased) could miss pockets of relevant documents. I would have found what I expected except that I started with continuous active learning (CAL) instead of simple passive learning (SPL), meaning that when I pulled a batch of documents predicted to be most likely to be relevant and reviewed them I allowed the system to learn from the tags that I applied so it could make better predictions when I pulled the next batch (that approach happened to be more convenient with our software at the time). What I found was that as I pulled more and more batches of documents that were predicted to be relevant, allowing the system to update its predictions as I went, it continued to find relevant documents until I had well over 90% recall. It never “got stuck on an island” where it couldn’t get to the remaining relevant documents because they were too different from the documents it had already seen. It has taken me a year to get around to writing about this because I wanted to do more testing.
Weak Seed
Could the single seed document I picked have been unusually potent? After achieving 90% recall I randomly selected one of the relevant documents the system didn’t find and started over using that document as the single seed document. If the system had so much trouble finding that document, it must surely be rather different from the other relevant documents and would serve as a very weak seed. I was able to hit 90% recall with CAL even with the single weak seed. The figure below compares the single random seed (left) to the single weak seed (right) for passive learning (top row) and continuous active learning (bottom row). Each bar represents a batch of documents, with the number of documents that were actually relevant represented in orange. Click the figure for a larger view.
The weak seed is seen to be quite ineffective with passive learning (upper right graph). It finds a modest number of relevant documents during the first three batches (no learning between batches). After three batches there are no relevant documents left that are remotely similar to the seed document, so the system only finds relevant documents by tripping over them at random (prevalence is 2.6%). The single random seed on the left did much better with passive learning than the weak seed, but it still ran out of gas while leaving many relevant documents undiscovered. Both seeds worked well with CAL. The first bar in the CAL chart for the weak seed shows it finding only a modest number of relevant documents (the same as SPL), but when it analyzes those documents and learns from them it is able to do much better with the second batch. The initial disadvantage of the weak seed is quickly erased.
Wrong Seed
If a weak seed works, what about a seed that is totally wrong? I picked a random document that was not remotely close to being relevant and tagged it as relevant. To make things a little more interesting, I also tagged the random seed document from the test above as non-relevant. So, my seed set consisted of two documents telling the system that non-relevant documents are relevant and relevant documents are non-relevant. When I ask the system to give me batches of documents that are predicted to be relevant it will surely give me a bunch of non-relevant documents, which I would then tag correctly. Will it be able find its way to the relevant documents in spite of starting out going in the completely wrong direction? Here is the result:
As expected, the first batch was purely non-relevant documents, but it was able to learn something from them–it learned what types of documents to avoid. In the second batch it managed to stumble across a single relevant (and correctly tagged!) document. The single seed hypothesis says that that single relevant document in the second batch should be enough to find virtually all of the relevant documents, and it was. It hit 97% recall in the graph above (before I decided to stop it). Also, the software warned me when the calculation was finished that the two seed documents appeared to be tagged incorrectly–my attempt at sabotage was detected!
Before proceeding with more experiments, I want to mention a point about efficiency. When the system has seen only a single relevant training document it doesn’t know which words in the document made it relevant, so the first batch of predictions may not be very good–it may pick documents because they contain certain words seen in the seed document when those words were not particularly important. As a result, it is more efficient to do a few small batches to allow it to sort out which words really matter before moving on to larger batches. Optimal batch size depends on the task, the document set, and the classification algorithm, but small is generally better than large.
Disjoint Relevance
Maybe the relevant documents for the categorization task performed above were too homogeneous. Could it find everything if my definition of relevance included documents that were extremely different from each other? To test that I used a bunch of news articles and I defined a document to be relevant if it was about golf or biology. There were no articles that were about both golf and biology. If I seeded it with a single biology article, could it find the golf articles and vice versa? This figure shows the results:
It achieved 88% recall after reviewing 5.5% of the document population in both cases (prevalence was 3.6%). The top graph was seeded with a single random biology article, whereas the bottom one was seeded with a single random golf article. Golf articles are much easier for an algorithm to identify (once it has seen one in training) than biology articles. The bottom graph should serve as a warning to not give up too soon if it looks like the system is no longer finding relevant documents.
How did it go from finding articles about biology to articles about golf? The first golf article was found because it contained words like Watson, stole, Stanford, British, etc. None of the words would be considered very strong indicators of relevance for biology. When the low-hanging fruit has already been picked, the algorithm is going to start probing documents containing words that have been seen in relevant documents but whose importance is less certain. If that leads to a new relevant document, the system is exposed to new words that may be good indicators of relevance (e.g., golf-related words in this case), leading to additional relevant documents in later batches. If the documents it probes turn out to be non-relevant, it learns that those words weren’t good indicators of relevance and it heads in a different direction. You can think of it as being like the Six Degrees of Kevin Bacon game–the algorithm can get from the single seed document to virtually any relevant document by hopping from one relevant document to another via the words they have in common, discovering new words that allow connecting to different documents as it goes.
Performance
If it is possible to find most of the relevant documents from a single seed, is it an efficient approach? The figure below addresses that question. The arrows indicate various recall levels.
The first graph above shows SPL with the optimal amount of random training to reach 75% recall with the least total document review. The first eight batches are the random training documents–you can see that those batches contain very few relevant documents. After the eight training batches, documents that are predicted to be relevant are pulled in batches. For SPL, the system is not allowed to learn beyond the initial random training documents. The second graph shows CAL with the same set of random training documents. You can see that it reached 75% recall more quickly, and it reached 88% recall with the amount of document review that SPL took to reach 75% recall. The final graph shows CAL with a single seed. You can see in the figure above that two small batches of documents predicted to be relevant were reviewed before moving to full-sized batches.
The figure shows that the single seed CAL result usually hit a recall level two or three batches later than the more heavily trained CAL result, but it also had nearly eight batches less of “training” data (I’m putting quotes around training here because all reviewed documents are really training documents with CAL–the system learns from all of them), so the improvement from the random training data (2 or 3 batches less of “review”) wasn’t sufficient to cover the cost (8 more batches of “training”). The relative benefit of training with randomly selected documents may vary depending on the situation (e.g., reducing the “review” phase at the expense of more “training” for a larger document collection may be more worthwhile), but at least in the example above random sampling for training isn’t worthwhile beyond finding the first relevant seed document, which could probably be found more efficiently with keyword search. Judgmental sampling may be worthwhile if it is good at finding a diverse set of relevant documents while avoiding non-relevant ones.
The table below shows the proportion of the document set that must be reviewed, including training documents, to reach 75% recall for several different categorization tasks with varying prevalence and difficulty. In each case SPL was trained with a set of random documents with size optimized to achieve 75% recall with minimal review. The result called simply “CAL” uses the same random training set as the SPL result but allows learning to continue when batches of relevant documents are pulled. It would be unusual to use a large amount of random training documents with CAL, rather than using judgmental sampling, but I wanted to be able to show how much CAL improves on SPL with the same seed set and then show the additional benefit of reducing the seed set down to a single relevant document.
| Task | Prevalence | SPL | CAL | CAL rand SS | CAL weak SS |
|---|---|---|---|---|---|
| 1 | 6.9% | 10.9% | 8.3% | 7.7% | 7.6% |
| 2 | 4.1% | 4.3% | 3.7% | 3.5% | 3.5% |
| 3 | 2.9% | 16.6% | 12.5% | 8.6% | 8.8% |
| 4 | 1.1% | 10.9% | 6.7% | 3.2% | 3.2% |
| 5 | 0.68% | 1.8% | 1.8% | 0.8% | 0.9% |
| 6 | 0.52% | 29.6% | 8.4% | 5.2% | 7.1% |
| 7 | 0.32% | 25.7% | 17.1% | 2.6% | 2.6% |
In every case CAL beat SPL, and a single seed (whether random or weak) was always better than using CAL with the full training set that was used for SPL. Of course, it is possible that CAL with a smaller random seed set or a judgmental sample would be better than CAL with a single seed.
Short Documents
Since the algorithm hops from one relevant document to another by probing words that the documents have in common, will it get stuck if the documents are short because there are fewer words to explore? To test that I truncated each document at 350 characters, being careful not to cut any words in half. With an average of 95% of the document text removed, there will surely be some documents where all of the text that made them relevant is gone, so they’ll be tagged as relevant but there is virtually nothing in the text to justify considering them to be relevant, which will make performance metrics look bad. This table gives the percentage of the document population that must be reviewed, including any training, to reach 75% recall compared to SPL (trained with optimal number of random docs to reach 75% recall):
| Task | Prevalence | SPL | CAL rand SS | CAL weak SS |
|---|---|---|---|---|
| Full Docs | 2.6% | 5.0% | 3.0% | 3.1% |
| Truncated Docs | 2.6% | 19.4% | 7.1% | 6.9% |
The table shows that CAL with a single seed requires much less document review than SPL regardless of whether the documents are long or short, and even a weak seed works with the short documents.
How Seed Sets Influence Which Documents are Found
I’ve shown that you can find most of the relevant documents with CAL using any single seed, but does the seed impact which relevant documents you find? The answer is that it has very little impact. Whatever seed you start with, the algorithm is going to want to move toward the relevant documents that are easiest for it to identify. It may take a few batches before it gets exposed to the features that lead it to the easy documents (e.g., words that are strong indicators of relevance but are also fairly common, so it is easy to encounter them and they lead to a lot of relevant documents), but once it encounters the relevant documents that are easy to identify they will quickly overwhelm the few oddball relevant documents that may have come from a weak seed. The predictions that generate later batches are heavily influenced by the relevant documents that are easy for the algorithm to find, and each additional batch of documents and associated learning erases more and more of impact from the starting seed. To illustrate this point, I ran calculations for a difficult categorization task (relevant documents were scattered across many small concept clusters) and achieved exactly 75% recall using various approaches. I then compared the results to see how many documents the different approaches had in common. Each approach found 848 relevant documents. Here is the overlap between the results:
| Row | Comparison | Num Relevant Docs in Common |
|---|---|---|
| 1 | Algorithm 1: CAL Rand SS v. CAL Weak SS | 822 |
| 2 | Algorithm 2: CAL Rand SS v. CAL Weak SS | 821 |
| 3 | Algorithm 1 CAL Rand SS v. Algorithm 2 CAL Rand SS | 724 |
| 4 | Algorithm 1: SPL Rand1 v. SPL Rand2 | 724 |
| 5 | Algorithm 1: SPL Rand1 v. CAL Rand SS | 725 |
| 6 | Algorithm 1: SPL Rand1 v. SPL Biased Seed | 708 |
| 7 | Algorithm 1: CAL Rand SS v. CAL Biased Seed | 824 |
The maximum possible number of documents that two calculations can have in common is 848 (i.e., all of them), and the absolute minimum is 565 because if one approach gives 75% recall a second approach can find at most the 25% of the full set of relevant documents the first didn’t find and then it must resort to finding documents that the first approach found. If two approaches were completely independent (think of one approach as picking relevant documents randomly), you would expect them to have 636 documents in common. So, it is reasonable to expect the numbers in the right column of the table to lie between 636 and 848, and they probably shouldn’t get very close to 636 since no two predictive coding approaches should be completely independent because they will detect many of the same patterns that make a document relevant.
Row 1 of the table shows that CAL with the same algorithm and two different single seeds, one random and one weak, give nearly the same set of relevant documents, with 822 of the 848 relevant documents found being the same. Row 2 shows that the result from Row 1 also holds for a different classification algorithm. Row 3 shows that if we use two different classification algorithms but use the same seed, the agreement between the results is much lower at just 724 documents. In other words, Rows 1-3 combined show that the specific relevant documents found with CAL depends much more on the classification algorithm used than on the seed document(s).
Row 4 shows that SPL with two different training sets of 4,000 random documents generates results with modest agreement, and Row 5 shows that the agreement is comparable between SPL trained with 4,000 random documents and CAL with a single random seed, so the CAL result is not particularly abnormal compared to SPL.
Row 6 compares SPL with 4,000 random training documents to SPL with a training set from a very biased search query that yielded 272 documents with 51 of them being relevant (prevalence for the full document population is 1.1%). The biased training set with SPL gives a result that is more different from SPL with random training than anything else tested. In other words, with SPL any bias in the training set is reflected in the final result. Row 7 shows that when that same biased training set is fed to CAL it has virtually no impact on which relevant documents are returned–the result is almost the same as the single seed with CAL.
Other Classification Algorithms
Will other classification algorithms work with a single relevant seed document? I tried a different (not terribly good) classification algorithm and it did work with a single seed, although it took somewhat more document review to reach the same level of recall. Keep in mind that all predictive coding software is different (there are many layers of algorithms and many different algorithms at each layer), so your mileage may vary. You should consult with your vendor when considering a different workflow, and always test the results for each case to ensure there are no surprises. The algorithm’s hyperparameters (e.g., amount of regularization) may need to be adjusted for optimal performance with single seed CAL.
Can It Fail?
A study by Grossman and Cormack shows a case where the single seed hypothesis failed for Topic 203 in Table 6 on page 159. They used a random sample containing two relevant documents and applied CAL, but the system had to go through 88% of the document population to find 75% of the relevant documents. It didn’t merely do a bad job of finding relevant documents–it actively avoided them (worse than reviewing documents randomly)! Gordon Cormack was kind enough to reply to my emails about this odd occurrence. I’m not sure this one is fully understood (machine learning can get a little complicated under the hood), but I think it’s fair to say that there was a strange confluence between some odd data and the way the algorithm interpreted it that allowed the algorithm to get stuck and not learn.
Here are some things that I could see (pure conjecture without any testing) potentially causing problems. If the document population contains multiple languages, I would not expect a single seed to be enough. One relevant seed document per language would be required because I don’t think documents in different languages would have enough words in common for the algorithm to wander across a language boundary. A classification algorithm that is too stiff (e.g., too much regularization) may fail to find significant pockets of documents–you really want the system to easily get pushed in different directions when a new relevant document is discovered. If there is a particular type of document in the population that contains the same common chunk of text in every document while relevance is determined by some other part of the document, you may need a relevant seed document from that document type or the system may conclude that the common chunk of text that occurs in all documents of that type is such a strong non-relevance indicator that they may not be probed enough to learn that some of them are relevant.
Conclusions
My tests were performed on fairly clean documents with no near-dupes. Actual e-discovery data can be uglier, and it can be hard to determine how a specific algorithm will react to it. There is no guarantee that a single relevant seed document will be enough, but the experiments I’ve described should at least suggest that with CAL the seed set can be quite minimal, which allows relevant documents to be reviewed earlier in the process. Avoiding large training sets also means that predictive coding with CAL can be worthwhile for smaller document collections. Finally, with CAL, unlike SPL, the specific relevant documents that are found depend almost entirely on the algorithm used, not the seed set, so there is little point in arguing about seed set quality if CAL is used.
Can You Really Compete in TREC Retroactively?
I recently encountered a marketing piece where a vendor claimed that their tests showed their predictive coding software demonstrated favorable performance compared to the software tested in the 2009 TREC Legal Track for Topic 207 (finding Enron emails about fantasy football). I spent some time puzzling about how they could possibly have measured their performance when they didn’t actually participate in TREC 2009.
One might question how meaningful it is to compare to performance results from 2009 since the TREC participants have probably improved their software over the past six years. Still, how could you do the comparison if you wanted to? The stumbling block is that TREC did not produce a yes/no relevance determination for all of the Enron emails. Rather, they did stratified sampling and estimated recall and prevalence for the participating teams by producing relevance determinations for just a few thousand emails.
Stratified sampling means that the documents are separated into mutually-exclusive buckets called “strata.” To the degree that stratification manages to put similar things into the same stratum, it can produce better statistical estimates (smaller uncertainty for a given amount of document review). The TREC Legal Track for 2009 created a stratum containing documents that all participants agreed were relevant. It also created four strata containing documents that all but one participant predicted were relevant (there were four participants, so one stratum for each dissenting participant). There were six strata where two participants agreed on relevance, and four strata where only one of the four participants predicted the documents were relevant. Finally, there was one stratum containing documents that all participants predicted were non-relevant, which was called the “All-N” stratum. So, for each stratum a particular participant either predicted that all of the documents were relevant or they predicted that all of the documents were non-relevant. You can view details about the strata in table 21 on page 39 here. Here is an example of what a stratification might look like for just two participants (the number of documents shown and percentage that are relevant may differ from the actual data):
A random subset of documents from each stratum was chosen and reviewed so that the percentage of the documents in the stratum that were relevant could be estimated. Multiplying that percentage by the number of documents in the stratum gives an estimate for the number of relevant documents in the stratum. Combining the results for the various strata allows precision and recall estimates to be computed for each participant. How could this be done for a team that didn’t participate? Before presenting some ideas, it will be useful to have some notation:
N[i] = number of documents in stratum i
n[i] = num docs in i that were assessed by TREC
n+[i] = num docs in i that TREC assessed as relevant
V[i] = num docs in i that vendor predicted were relevant
v[i] = num docs in i that vendor predicted were relevant and were assessed by TREC
v+[i] = num docs in i that vendor predicted were relevant and assessed as relevant by TREC
To make some of the discussion below more concrete, I’ll provide formulas for computing the number of true positives (TP), false positives (FP), and false negatives (FN). The recall and precision can then be computed from:
R = TP / (TP + FN)
P = TP / (TP + FP)
Here are some ideas I came up with:
1) They could have checked to see which strata the documents they predicted to be relevant fell into and applied the percentages TREC computed to their data. The problem is that since they probably didn’t identify all of the documents in a stratum as being relevant the percentage of documents that were estimated to be relevant for the stratum by TREC wouldn’t really be applicable. If their system worked really well, they may have only predicted that the truly relevant documents from the stratum were relevant. If their system worked badly, their system may have predicted that only the truly non-relevant documents from the stratum were relevant. This approach could give estimates that are systematically too low or too high. Here are the relevant formulas (summing over strata, i):
TP = Sum{ V[i] * n+[i] / n[i] }
FP = Sum{ V[i] * (1 – n+[i]/n[i]) }
FN = Sum{ (N[i] – V[i]) * n+[i] / n[i] }
2) Instead of using the percentages computed by TREC, they could have computed their own percentages by looking at only the documents in the stratum that they predicted were relevant and were reviewed by TREC to give a relevance determination. This would eliminate the possible bias from approach (1), but it also means that the percentages would be computed from a smaller sample, so the uncertainty in the percentage that are relevant would be bigger. The vendor didn’t provide confidence intervals for their results. Here is how the computation would go:
TP = Sum{ V[i] * v+[i] / v[i] }
FP = Sum{ V[i] * (1 – v+[i]/v[i]) }
FN = Sum{ (N[i] – V[i]) * (n+[i] – v+[i]) / (n[i] – v[i]) }
It’s possible that for some strata there would be no overlap between the documents TREC assessed and the documents the vendor predicted to be relevant since TREC typically assessed only about 4% of each stratum for Topic 207 (except the All-N stratum, where they assessed only 0.46%). This approach wouldn’t work for those strata since v[i] would be 0. For strata where v[i] is 0, one might use approach (1) and hope it isn’t too wrong.
3) A more sophisticated tweak on (2) would be to use the ratio n+[i]/n[i] from (1) to generate a Bayesian prior probability distribution for the proportion of documents predicted by the vendor to be relevant that actually are relevant, and then use v+[i] and v[i] to compute a posterior distribution for that proportion and use the mean of that distribution instead of v+[i]/v[i] in the computation. The idea is to have a smooth interpolation between using n+[i]/n[i] and using v+[i]/v[i] as the proportion of documents estimated to be relevant, where the interpolation would be closer to v+[i]/v[i] if v[i] is large (i.e., if there is enough data for v+[i]/v[i] to be reasonably accurate). The result would be sensitive to choices made in creating the Bayesian prior (i.e., how much variance to give the probability distribution), however.
4) They could have ignored all of the documents that weren’t reviewed in TREC (over 500,000 of them) and just performed their predictions and analysis on the 3,709 documents that had relevance assessments (training documents should come from the set TREC didn’t assess and should be reviewed by the vendor to simulate actual training at TREC being done by the participants). It would be very important to weight the results to compensate for the fact that those 3,709 documents didn’t all have the same probability of being selected for review. TREC oversampled the documents that were predicted to be relevant compared to the remainder (i.e., the number of documents sampled from a stratum was not simply proportional to the number of documents in the stratum), which allowed their stratification scheme to do a good job of comparing the participating teams to each other at the expense of having large uncertainty for some quantities like the total number of relevant documents. The prevalence of relevant documents in the full population was 1.5%, but 9.0% of the documents having relevance assessments were relevant. Without weighting the results to compensate for the uneven sampling, you would be throwing away over half a million non-relevant documents without giving the system being tested the opportunity to incorrectly predict that some of them are relevant, which would lead to an inflated precision estimate. The expression “shooting fish in a barrel” comes to mind. Weighting would be accomplished by dividing by the probability of the document having been chosen (after this article was published I learned that this is called the Horvitz-Thompson estimator, and it is what the TREC evaluation toolkit uses), which is just n[i]/N[i], so the computation would be:
TP = Sum{ (N[i]/n[i]) * v+[i] }
FP = Sum{ (N[i]/n[i]) * (v[i] – v+[i]) }
FN = Sum{ (N[i]/n[i]) * (n+[i] – v+[i]) }
Note that if N[i]/n[i] is equal to V[i]/v[i], which is expected to be approximately true since the subset of a stratum chosen for assessment by TREC is random, the result would be equal to that from (2). If N[i]/n[i] is not equal to V[i]/v[i] for a stratum, we would have the disturbing result that the estimate for TP+FP for that stratum would not equal the number of documents the vendor predicted to be relevant for that stratum, V[i].
5) The vendor could have ignored the TREC relevance determinations, simply doing their own. That would be highly biased in the vendor’s favor because there would be a level of consistency between relevance determinations for the training data and testing data that did not exist for TREC participants. At TREC the participants made their own relevance determinations to train their systems and a separate set of Topic Authorities made the final relevance judgments that determined the performance numbers. To the degree that participants came to different conclusions about relevance compared to the Topic Authorities, their performance numbers would suffer. A more subtle problem with this approach is that the vendor’s interpretation of the relevance criteria would inevitably be somewhat different from that of TREC assessors (studies have shown poor agreement between different review teams), which could make the classification task either easier or harder for a computer. As an extreme example, if the vendor took all documents containing the word “football” to be relevant and all other documents to be non-relevant, it would be very easy for a predictive coding system to identify that pattern and achieve good performance numbers.
Approaches (1)-(4) would all give the same results for the original TREC participants because for each stratum they would either have V[i]=0 (so v[i]=0 and v+[i]=0) or they would have V[i]=N[i] (so v[i]=n[i] and v+[i]=n+[i]). The approaches differ in how they account for the vendor predicting that only a subset of a stratum is relevant. None of the approaches described are great. Is there a better approach that I missed? TREC designed their strata to make the best possible comparisons between the participants. It’s hard to imagine how an analysis could be as accurate for a system that was not taken into account in the stratification process. If a vendor is tempted to make such comparisons, they should at least disclose their methodology and provide confidence intervals on their results so prospective clients can determine whether the performance numbers are actually meaningful.
Comments on Rio Tinto v. Vale and Sample Size
Judge Peck recently issued an opinion in Rio Tinto PLC v. Vale SA, et al, Case 1:14-cv-03042-RMB-AJP where he spent some time reflecting on the state of court acceptance of technology-assisted review (a.k.a. predictive coding). The quote that will surely grab headlines is on page 2: “In the three years since Da Silva Moore, the case law has developed to the point that it is now black letter law that where the producing party wants to utilize TAR for document review, courts will permit it.” He lists the relevant cases and talks a bit about transparency and disclosing seed sets. It is certainly worth reading.
Both parties in Rio Tinto v. Vale have agreed to disclose all non-privileged documents, including non-responsive documents, from their control sets, seed sets, and training sets. Judge Peck accepts their protocol because they both agree to it, but hints that disclosing seed sets may not really be necessary (p. 6, “…requesting parties can insure that training and review was done appropriately by other means…”).
I find one other aspect of the protocol the litigants proposed to be worthy of comment. They make a point of defining a “Statistically Valid Sample” on p. 11 to be one that gives +/- 2% margin of error at 95% confidence, and even provide an equation to compute the sample size in footnote 2. Their equation gives a sample size of at most 2,395 documents, depending on prevalence. They then use the “Statistically Valid Sample” term in contexts where it isn’t (as they’ve defined it) directly appropriate. I don’t know if this is just sloppiness (missing details about what they actually plan to do) or a misunderstanding of statistics.
For example, section 4.a.ii on p. 13 contemplates culling before application of predictive coding, and says they will “Review a Statistically Valid Sample from the Excluded Documents.” Kudos to them for actually measuring how many relevant documents they are culling instead of just assuming that keyword search results should be good enough without any analysis, but 2,395 documents is not the right sample size. The more documents you are culling, the more precisely you need to know what proportion of them were relevant in order to have a reasonably precise value for the number of relevant documents culled, which is what matters for computing recall. In other words, a +/- 2% measurement on the culled set does not mean +/- 2% for recall. I described a similar situation in more detail in my Predictive Coding Confusion article under the heading “Beware small percentages of large numbers.” My eRecall: No Free Lunch article also discusses similar issues.
Section 4.b on p. 13 says that the control set will be a Statistically Valid Sample that will be used to measure prevalence. They explain in a separate letter to Judge Peck on p. 9 that the control set will be used to track progress by estimating precision and recall. Do they intend to use 2,395 (or fewer) documents for the control set? Suppose only one of the 2,395 documents is actually relevant. That would give a prevalence estimate of 0.0011% to 0.2321% with 95% confidence (via this calculator), which is certainly better than the required +/- 2%, but it is useless for tracking progress because the uncertainty is huge compared to the value itself. If they had a million documents, the estimate would tell them that somewhere between 11 and 2,321 of them are relevant. So, if they found 11 relevant documents with their predictive coding software they would estimate that they achieved somewhere between 0.5% and 100% recall. To look at it a little differently, if they looked at their system’s prediction for the control set they would find that it either correctly predicted that the one relevant document was relevant (100% recall) or they would find that it was predicted incorrectly (0% recall), with dumb luck being a big factor in which result they got.
Maybe they intended that the control set contain 2,395 relevant documents, which would give a recall estimate accurate to +/- 2% with 95% confidence (more precise than really seems worthwhile for a control set) by measuring the percentage of relevant documents in the control set that are predicted correctly. If prevalence is 10%, the control set would need to contain about 23,950 documents to have 2,395 that are relevant. If prevalence is 1%, the control set would require about 239,500 documents. That sure seems like a lot of documents to review just to create a control set. The point is that it is the number of relevant documents in the control set, not the number of documents, that determines how precisely the control set can measure recall. Their protocol does say that the requesting party will have ten business days to check the control set if it is more than 4,000 documents, so it does seem that they’ve contemplated the possibility of using more than 2,395 documents in the control set, but the details of what they are really planning to do are missing. Of course, the control set is there to help the producing party optimize their process, so it is their loss if they get it wrong (assuming there is separate testing that would detect the problem, as described in section 4.f).
Finally, section 4.f on p. 16 talks about taking a Statistically Valid Sample from the documents that are predicted to be non-relevant to estimate the number of relevant documents that were missed by predictive coding, leading to a recall estimate. This has the same problem as the culling in section 4.a.ii — the size of the sample that is required to achieve a desired level of uncertainty in the recall depends on the size of the set of documents being culled, whether the culling is due to keyword searching before applying predictive coding or whether the culling is due to discarding documents that the predictive coding system predicts are non-relevant.
If the goal is to arrive at a reasonably precise estimate of recall (and, I’m certainly not arguing that +/- 2% should be required), it is important to keep track of how the uncertainty from each sample propagates through to the final recall result (e.g. it may be multiplied by some large number of culled documents) when choosing an appropriate sample size. I may be nitpicking, but it strikes me as odd to lay out a specific formula for calculating sample size and then not mention that it cannot be applied directly for the sampling that is actually being contemplated.
Highlights from the East Coast eDiscovery & IG Retreat 2014
The East Coast eDiscovery & IG Retreat is a new addition to the series of retreats held by Chris La Cour’s company, Ing3nious. It was held at the Chatham Ba
rs Inn in Chatham, Massachusetts. This is the fourth year Chris has been organizing retreats, with the number of retreats and diversity of themes increasing in recent years, but this is the first one held outside of California. As always, the venue was beautiful (more photos here), and the conference was informative and well-organized. My notes below only capture a small amount of the information presented. There were often two simultaneous sessions, so I couldn’t attend everything.
Keynote: Information Governance in a Predictive World
Big data isn’t just about applying the same old analysis to more data. It’s about real-time or near-real-time, action-oriented analysis. It allows asking new questions. Technology now exists to allow police to scan license plates while driving through a parking lot to check for stolen cars. Amazon may implement predictive shipping, where it ships a product to a customer before the customer orders it, which requires predictions with high confidence. Face recognition technology is already in use at airports to track who is entering and whether they belong there. When he tweeted a complaint about an airline, he got a personalized reply from the airline on twitter within 30 seconds, thanks to technology.
Proactive information governance will allow problems (sexual harassment, fraud) to be detected immediately so they can be corrected, instead of finding out later when there is a lawsuit. It is possible to predict when someone will leave a company — they may become short with people, or spend more time on LInkedIn.
Change will come to the Federal Rules of Civil Procedure in 2015. Rule 37(e) on sanctions for spoliation will change from “willful or bad faith” to “willful and bad faith” to encourage more deletion. Have a policy to delete as much as possible, follow it, and be prepared to prove that you follow it.
Case Study: The Swiss Army Knife Approach to eDiscovery
Predictive coding failed for a case where the documents were very homogeneous (keywords didn’t work either; the document set was large, but not too large for eyes-on review). It also failed when there were too many issue codes. Predictive coding had problems with Spanish documents where there many different dialects. Predictive coding works well for prioritizing documents when there is a quick deposition schedule.
Email domain name filtering can be used to remove junk or to detect things like gmail accounts that may be relevant. Also, look for gaps in dates for emails — it could be that the person was on vacation, or maybe something was removed.
Clustering or near-dupe is useful to ensure consistency of redactions.
There is a benefit to reviewing the documents yourself to understand the case. Not a fan of producing documents you haven’t seen.
May need to do predictive coding or clustering in a foreign country to minimize the amount of data that needs to be brought to the U.S..
Talk to custodians about acronyms. Look at word lists — watch for unusual words, or how word usage corresponds to timing of events.
Real-World Impact of Information Governance on the eDiscovery Process
I couldn’t attend this one.
Future-think: How Will eDiscovery Be Different in 5-10 Years?
Cost cutting: Cull in house, and targeted collections. The big obstacle is getting people to learn technology. Some people still print email to read it.
Analysis of cases is accelerating. Even small cases are impacted by technology. For example, information about a car accident is recorded. In the future, ESI will be collected and a conclusion will be reached without a trial. Human memory is obsolete — everything will be ESI.
Personal devices may be subject to discovery. Employment agreements should make that clear in advance.
Privacy will be a big legal field. Information is even collected about children — what they eat at school and when they are on the bus.
Will schools be held liable for student loans if the school fails to predict that the student will fail?
There are concerns about security of the cloud.
Businesses should demand change to make things more efficient, like arbitration.
Information Governance – Teams, Litigation Holds and Best Practices
I couldn’t attend this one.
Recent Developments in Technology Assisted Review — Is TAR Gaining Traction?
Three panelists said predictive coding didn’t work for them for identifying privileged documents. One panelist (me) said he had a case where it worked well for priv docs. Although I didn’t mention it at the time since I wouldn’t be able to get the reference right off of the top of my head, there is a paper with a nice discussion about the issues around finding priv docs that also claims success at using predictive coding.
What level of recall is necessary? There seemed to be consensus that 75% was acceptable for most purposes, but people sometimes aim higher, depending on the circumstances (e.g., to ward off objections from the other side).
Is it OK to use keyword search to cull down the document population before applying predictive coding? Must be careful not to cull away too many of the relevant documents (e.g., the Biomet case).
There is a lot of concern about being required to turn over training documents (especially non-responsive ones) to the other side. I pointed out that it is not like turning over search terms. It is very clear whether or not a document matches a search query, but disclosing training documents does not tell what predictions a particular piece of predictive coding software will give. In fact, some software will (hopefully, rarely) fail to produce documents that are near-dupes of relevant training documents, so one should not assume that the disclosure of training documents guarantees anything about what will be produced. There was concern that disclosure of non-relevant training documents by some parties will set a bad precedent.
Top 5 Trends in Discovery for 2014
I couldn’t attend this one.
Recruiting the best eDiscovery Team
Cybersecurity is a concern. It is important to vet service providers. Many law firms are not as well protected as one would like.
When required to give depositions about e-discovery process, paralegals can do well. IT people tend to get stressed. Lawyers can be too argumentative.
Need a champion to encourage everyone to get things done.
Legal hold is often drafted by outside counsel but enforced by in-house counsel.
Don’t have custodians do their own self-collection (e.g., based on search terms), but may have IT do collection (less expensive than using outside consultant, but must be able to explain what they did).
Information governance and changes to FRCP will reduce costs over the next five years.
eRecall: No Free Lunch
There has been some debate recently about the value of the “eRecall” method compared to the “Direct Recall” method for estimating the recall achieved with technology-assisted review. This article shows why eRecall requires sampling and reviewing just as many documents as the direct method if you want to achieve the same level of certainty in the result.
Here is the equation:
eRecall = (TotalRelevant – RelevantDocsMissed) / TotalRelevant
Rearranging a little:
eRecall = 1 – RelevantDocsMissed / TotalRelevant
= 1 – FractionMissed * TotalDocumentsCulled / TotalRelevant
It requires estimation (via sampling) of two quantities: the total number of relevant documents, and the number of relevant documents that were culled by the TAR tool. If your approach to TAR involves using only random sampling for training, you may have a very good estimate of the prevalence of relevant documents in the full population by simply measuring it on your (potentially large) training set, so you multiply the prevalence by the total number of documents to get TotalRelevant. To estimate the number of relevant documents missed (culled by TAR), you would need to review a random sample of the culled documents to measure the percentage of them that were relevant, i.e. FractionMissed (commonly known as the false omission rate or elusion). How many?
To simplify the argument, let’s assume that the total number of relevant documents is known exactly, so there is no need to worry about the fact that the equation involves a non-linear combination of two uncertain quantities. Also, we’ll assume that the prevalence is low, so the number of documents culled will be nearly equal to the total number of documents. For example, if the prevalence is 1% we might end up culling about 95% to 98% of the documents. With this approximation, we have:
eRecall = 1 – FractionMissed / Prevalence
It is the very small prevalence value in the denominator that is the killer–it amplifies the error bar on FractionMissed, which means we have to take a ton of samples when measuring FractionMissed to achieve a reasonable error bar on eRecall.
Let’s try some specific numbers. Suppose the prevalence is 1% and the recall (that we’re trying to estimate) happens to be 75%. Measuring FractionMissed should give a result of about 0.25% if we take a big enough sample to have a reasonably accurate result. If we sampled 4,000 documents from the culled set and 10 of them were relevant (i.e., 0.25%), the 95% confidence interval for FractionMissed would be (using an exact confidence interval calculator to avoid getting bad results when working with extreme values, as I advocated in a previous article):
FractionMissed = 0.12% to 0.46% with 95% confidence (4,000 samples)
Plugging those values into the eRecall equation gives a recall estimate ranging from 54% to 88% with 95% confidence. Not a very tight error bar!
If the number of samples was increased to 40,000 (with 100 being relevant, so 0.25% again), we would have:
FractionMissed = 0.20% to 0.30% with 95% confidence (40,000 samples)
Plugging that into the eRecall equation gives a recall estimate ranging from 70% to 80% with 95% confidence, so we have now reached the ±5% level that people often aim for.
For comparison, the Direct Recall method would involve pulling a sample of 40,000 documents from the whole document set to identify roughly 400 random relevant documents, and finding that roughly 300 of the 400 were correctly predicted by the TAR system (i.e., 75% recall). Using the calculator with a sample size of 400 and 300 relevant (“relevant” for the calculator means correctly-identified for our purposes here) gives a recall range of 70.5% to 79.2%.
So, the number of samples required for eRecall is about the same as the Direct Recall method if you require a comparable amount of certainty in the result. There’s no free lunch to be found here.
Predictive Coding Confusion
This article looks at a few common misconceptions and mistakes related to predictive coding and confidence intervals.
Confidence intervals vs. training set size: You can estimate the percentage of documents in a population having some property (e.g., is the document responsive, or does it contain the word “pizza”) by taking a random sample of the documents and measuring the percentage having that property. The confidence interval tells you how much uncertainty there is due to your measurement being made on a sample instead of the full population. If you sample 400 documents, the 95% confidence interval is +/- 5%, meaning that 95% of the time the range from -5% to +5% around your estimate will contain the actual value for the full population. For example, if you sample 400 documents and find that 64 are relevant (16%), there is a 95% chance that the range 11% to 21% will enclose the actual prevalence for the full document set. To cut the size of the confidence interval in half you need four times as many documents, so a sample of 1,600 documents gives a 95% confidence interval of +/- 2.5%. The sample size needed to achieve a confidence interval of a certain size does not depend on the number of documents in the full population (unless the sample is a substantial proportion of the entire document population, which would be strange), so sample sizes like 400 or 1,600 documents can be committed to memory and applied to any document set.
Sample sizes related to confidence intervals have nothing to do with sample sizes needed to train a predictive coding algorithm. Confidence intervals are for estimating the number of relevant documents, but training is about teaching the system to identify which documents are relevant. A pollster could survey 1,600 voters and estimate the number that would vote for a particular candidate to within +/- 2.5%, but that would not enable him/her to predict who some arbitrary person would vote for — that’s just a completely different problem from estimating the number of votes. For a predictive coding system to make good predictions it needs enough training documents for it to identify the patterns that indicate relevance. The number of documents required for training depends on the algorithm used and the difficulty of the categorization task. To illustrate that point, consider the two categorization tasks mentioned in my previous article. Both involve the same set of 100,000 documents and have nearly the same prevalence of relevant documents (0.986% and 1.131%), but the difficulty is very different. I measured the optimal number of random training documents to achieve 75% recall while reviewing the smallest possible total number of documents (training + review) and found:
| Training Docs | Review Docs | Total Docs Reviewed | |
|---|---|---|---|
| Task 1 | 300 | 800 | 1,100 |
| Task 2 | 4,500 | 6,500 | 11,000 |
If the number of training documents is below the optimal level, the predictions won’t be very good (low precision) and you’ll have to review an excessive number of non-relevant documents to find 75% of the relevant documents. If the number of training documents is above the optimal level, the benefit from higher precision achieved because of the extra training won’t be sufficient to offset the cost of reviewing the additional training documents. You can see from the table that there is a factor of 15 difference in the optimal number of training documents for the two tasks. Unlike the number of documents needed to achieve a certain confidence interval, there is no simple answer when it comes to the number of documents needed for training.
Sampling counts documents not importance: Sampling allows you to estimate the number of relevant documents that were missed by the predictive coding algorithm. That doesn’t tell you anything about the importance of the documents that were missed. As discussed in my article on relevance score, predictive coding algorithms put the documents that they are most confident are relevant at the top of the list. The documents at the top are not necessarily the most important documents for the case. To the extent that a “smoking gun” is very different from any of the documents in the training set, the algorithm may have little confidence that it is relevant, so it may get a modest or even low relevance score. Claims that nothing critical is lost when documents below some relevance score cutoff are culled because the number of relevant documents that are discarded is small are simply unfounded.
Beware small percentages of large numbers: Suppose that a predictive coding system identifies 30,000 documents out of a population of a million as likely to be relevant, and the vendor claims 95% precision was achieved, meaning that 95% of the documents predicted to be relevant actually are relevant. The vendor also claims, based on sampling, that only 1% of the predictions were false negatives, meaning that the system predicted that the documents were non-relevant when they were actually relevant. How many relevant documents were found, and how many were missed? The number found is:
95% * 30,000 = 28,500
The number missed is (I’m over-counting a little by not subtracting out documents used for training [i.e. no prediction was made for them], which were presumably a small fraction of the full population):
1% * 1,000,000 = 10,000
The recall, the percentage of relevant documents that were actually found by the predictive coding system, is (using point estimates in a non-linear equation like this is somewhat wrong, but in this case the error is less than 1%):
28,500 / (28,500 + 10,000) = 74%
That recall might be acceptable, but it isn’t great. The seemingly small 1% has a big impact because it applies to the entire document population, not just the relatively small number of documents that were predicted to be relevant. That 1% was estimated using sampling, so there is uncertainty in the value (there may also be uncertainty in the 95% precision value, but I’m going to ignore that for simplicity). If 1,600 documents were sampled and 16 were found to be false negatives, the 95% confidence interval would seem to go from -1.5% to +3.5%. How can the percentage be negative? It can’t. The +/- 2.5% interval is actually designed to accommodate the worst-case scenario, which occurs when 50% of the documents have the property being measured. When the percentage of documents having the property is very far from 50%, as is often the case in e-discovery, the 95% confidence interval is smaller and is not centered on the estimated value. Equations for the confidence interval in such situations involve approximations that won’t always be appropriate for the examples in this article, so I’m going to recommend that you use an exact confidence interval calculator instead of dealing with equations that will give wrong results if used when they’re not appropriate. With 95% confidence the interval is 0.57% to 1.62%, so the number of relevant documents missed is between 5,700 and 16,200 with 95% confidence. That means the recall is between 64% and 83% with 95% confidence.
What if the vendor based the 1% false negative number on a sample of 400 documents instead of 1,600? The confidence interval would be 0.27% to 2.54%, so the recall would be between 53% and 91% with 95% confidence (if you prefer the one-tail confidence interval, the upper bound is 2.27% giving a minimum recall of 56%). If all we have to go on is a 1% false negative number based on a 400 document sample, we cannot assume that we’ve found much more than half of the relevant documents! Not only does the 1% false negative number have a big impact, but it has a big error bar compared to the value itself (1% might really be 2.54%), so the worst case scenario is pretty ugly.
Beware recall values lacking confidence intervals: Suppose the vendor claims that 98% recall was achieved in the example above. Is that plausible in light of the 1% false negative number? Before diving into the math, it should be said that 98% recall is very high. The closer you get to 100% recall, the harder it is to find additional relevant documents without having to wade through a lot of non-relevant documents because all the relevant documents that are easy to identify have already been found. Achieving 95% precision at 98% recall, as claimed, would be close to perfection. So, how big is the error bar for that 98% recall?
Without knowing the details of how the vendor calculated the recall, let’s try to come up with something plausible. If the vendor set aside a control set (a random sample of documents that were reviewed but not used for training) of 1,600 documents to monitor the system’s ability to make good predictions as training progressed, and 50 of those documents were relevant and the system correctly predicted that 49 were relevant (so it missed just one), the recall estimate would be 49/50 = 98%. Turning to the confidence interval calculator and keeping in mind that our sample size is 50, not 1600, because we’re estimating the proportion of the sampled relevant documents (there are only 50) that were
correctly predicted to be relevant, we find with 95% confidence the range for the recall is from 89.3% to 99.9%. So, the claimed 98% recall might only be 89%. The recall estimate from the control set barely overlaps with the 53% to 91% recall range implied by a false negative number of 1% based on 400 sample documents, and definitely isn’t consistent with a 1% false negative number if that number was measured using 1,600 sample documents. Looking at it from a different angle, if 1% of predictions are false negatives and the control set contains 1,600 documents, you would expect the control set to contain about sixteen relevant documents that the system predicted were non-relevant, but a claim of 98% recall implies that there was only one such document, not sixteen. It’s hard to see the 1% false negative number as being consistent with 98% recall. We’ve been using a 95% confidence level, which means that 5% of the time the confidence interval we compute from the data will fail to capture the real value, so with recall estimates coming from different samples (from the control set and the sample used to measure false negatives) inconsistent results could mean that one value is simply wrong. Which one, though? The bottom line is that there are several ways to estimate recall, and all available numbers should be tested for consistency. Without a confidence interval, nobody will know how meaningful the the recall estimate really is.
Fair Comparison of Predictive Coding Performance
Understandably, vendors of predictive coding software want to show off numbers indicating that their software works well. It is important for users of such software to avoid drawing wrong conclusions from performance numbers.
Consider the two precision-recall curves below (if you need to brush up on the meaning of precision and recall, see my earlier article):
The one on the left is incredibly good, with 97% precision at 90% recall. The one on the right is not nearly as impressive, with 17% precision at 70% recall, though you could still find 70% of the relevant documents with no additional training by reviewing only the highest-rated 4.7% of the document population (excluding the documents reviewed for training and testing).
Why are the two curves so different? They come from the same algorithm applied to the same document population with the same features (words) analyzed and the exact same random sample of documents used for training. The only difference is the categorization task being attempted, i.e. what type of document we consider to be relevant. Both tasks have nearly the same prevalence of relevant documents (0.986% for the left and 1.131% for the right), but the task on the left is very easy and the one on the right is a lot harder. So, when a vendor quotes performance numbers, you need to keep in mind that they are only meaningful for the specific document set and task that they came from. Performance for a different task or document set may be very different. Comparing a vendor’s performance numbers to those from another source computed for a different categorization task on a different document set would be comparing apples to oranges.
Fair comparison of different predictive coding approaches is difficult, and one must be careful not to extrapolate results from any study too far. As an analogy, consider performing experiments to determine whether fertilizer X works better than fertilizer Y. You might plant marigolds in each fertilizer, apply the same amount of water and sunlight, and measure plant growth. In other words, keep everything the same except the fertilizer. That would give a result that applies to marigolds with the specific amount of sunlight and water used. Would the same result occur for carrots? You might take several different types of plants and apply the same experiment to each to see if there is a consistent winner. What if more water was used? Maybe fertilizer X works better for modest watering (it absorbs and retains water better) and fertilizer Y works better for heavy watering. You might want to present results for different amounts of water so people could choose the optimal fertilizer for the amount of rainfall in their locations. Or, you might determine the optimal amount of water for each, and declare the fertilizer that gives the most growth with its optimal amount of water the winner, which is useful only if gardeners/farmers can adjust water delivery. The number of experiments required to cover every possibility grows exponentially with the number of parameters that can be adjusted.
Predictive coding is more complicated because there are more interdependent parts that can be varied. Comparing classification algorithms on one document set may give a result that doesn’t apply to others, so you might test on several document sets (some with long documents, some with short, some with high prevalence, some with low, etc.), much like testing fertilizer on several types of plants, but that still doesn’t guarantee that a consistent winner will perform best on some untested set of documents. Does a different algorithm win if the amount of training data is higher/lower, similar to a different fertilizer winning if the amount of water is changed? What if the nature of the training data (e.g., random sample vs. active learning) is changed? The training approach can impact different classification algorithms differently (e.g., an active learning algorithm can be optimized for a specific classification algorithm), making the results from a study on one classification algorithm inapplicable to a different algorithm. When comparing two classification algorithms where one is known to perform poorly for high-dimensional data, should you use feature selection techniques to reduce the dimensionality of the data for that algorithm under the theory that that is how it would be used in practice, but knowing that any poor performance may come from removing an important feature rather than from a failure of the classification algorithm itself?
What you definitely should not do is plant a cactus in fertilizer X and a sunflower in fertilizer Y and compare the growth rates to draw a conclusion about which fertilizer is better. Likewise, you should not compare predictive coding performance numbers that came from different document sets or categorization tasks.
Highlights from the ACEDS 2014 E-Discovery Conference
The ACEDS E-Discovery Conference was a well-organized conference at a nice venue with two full days of informative sessions. Copies of all slides were provided to attendees, so my note-taking was mostly limited to things that weren’t on the slides, and reflects only a tiny fraction of the information presented. Also, there were sometimes two simultaneous sessions, so I couldn’t attend everything. If you attended, please let me know if you notice any errors in my notes below.
- There is some reluctance to use TAR (technology-assisted review) due to a fear that disclosure of the seed set, including privileged and non-responsive documents, will be required.
Royce Cohen said there was a recent case where documents couldn’t be clawed back in spite of having a clawback agreement because attorneys’ eyes had not been put on the produced documents.- A few years ago, use of TAR was typically disclosed, but today very few are disclosing it.
- Regarding the requirement for specificity rather than boilerplate in e-discovery objections, Judge Waxse recommended that everyone read the Mancia v. Mayflower Textile Services opinion by Judge Grimm.
- Judge Waxse said he resolves e-discovery disputes by putting the parties in a room with a video camera. “Like particles in physics, when lawyers are observed their behavior changes.”
- The tension between e-discovery cooperation and zealous advocacy of clients was discussed. It was pointed out that the ABA removed “zealous” from the Model Rules of Professional Conduct (aside from the preamble) in 1983 (sidenote: related article). John Barkett noted that the Federal Rules of Civil Procedure (FRCP) trump ethics rules, anyway.
- The preservation trigger is unchanged in the proposed changed to the FRCP.
- Stephen Burbank said that if the changes to the FRCP get to Congress, blocking them would require legislation by Congress. It is unlikely the divided Congress would get together to pass such legislation.
- Judge Waxse said he didn’t think the proposed changes to the FRCP would have a significant impact on proportionality. The problem with proportionality is not where it is located in the rules, but the difficulty for the court to decide the importance of the case before the trial. He also mentioned a case where one side wanted a protective order claiming e-discovery would cost $30 million, but then dropped that to $3 million when questioned, and ended up being only thousands of dollars. He said judges talk to each other, so be careful about providing bad cost estimates.
- On the other hand, Judge Hopkins expects a sea change on proportionality from the new rules.
- Judges Hopkins and Otazo-Reyes both said that phasing (e.g., give one out of fifteen custodians to start) is an important tool for proportionality.
- Judge Waxse said it is important to establish what is disputed before doing discovery since there is no point in doing discovery on things that aren’t disputed.
- Judge Waxse said he thinks it is malpractice to not have a 502(d) order (clawback agreement) in place.
- Judge Hopkins said that when documents are clawed back they cannot be “used,” but that is ambiguous. They can’t be used directly in trial, but can the info they contain be used indirectly when questioning for a deposition? Prohibiting indirect use could require changing out the litigation team.
Bill Speros expressed concern that the “marketing view” of TAR (that courts have said clearly that it is OK, and that past studies have proven that it is better than linear review), which is inaccurate, may feed back into the court and distort reality.- Bill Speros predicted that random sampling will fail because prevalence is too low, making it hard to find things that way. He warned that the producing party may be happy to bring in additional custodians to dilute the richness of the document set and reduce the chances of finding anything really damning.
- Mary Mack said that predictive coding has been successfully used by receiving parties.
- Bill Speros said we should look at concepts/words rather than counting documents to determine whether predictive coding worked. He pointed out that a small number of documents typically contain a large amount of text, so weighting on a document basis tends to undercount the information in the long documents.
- When trying to control e-discovery costs, some red flags are: lack of responsiveness, no clarity in billing, and lots of linear review.
- Seth Eichenholtz warned that when dealing with international data you have to be careful about just stuffing it all onto a server in the U.S.
- When storing e-discovery data in the cloud, be aware of HIPAA requirements if there are any medical records involved.
- Law firms using cloud e-discovery services risk losing the connection with the client to the cloud service provider.
- Be careful about your right to your data in the cloud, especially upon termination of the contract.
- In one case a cloud provider had borrowed money from a bank to purchase hard drives and the bank repossessed the drives (with client data) when the cloud provider had financial trouble.
- Be careful about what insurance companies will cover when it comes to data in the cloud.
- With TAR, 75% recall is becoming a standard acceptable level.
- It’s easier to get agreement on using TAR when both sides of the dispute have a lot of documents, so both benefit from cost savings.
Data should have an expiration date, like milk. If no action is taken to keep it, and there is no litigation hold, it should be deleted automatically.- Predictive coding allows review of the documents that are most likely to be relevant earlier, before the reviewer becomes fatigued and more likely to make mistakes.
- Jon Talotta said some law firms internalize e-discovery (rather than outsourcing to a vendor) at no profit to keep the relationship with the client. Some law firms make good money on e-discovery, but only because they are able to make full utilization of the capacity and they have clients that don’t have their own relationships with e-discovery service providers.
- A survey of the audience found that most law firms represented were just passing the e-discovery cost through to the client without trying to make a profit.
- Bill Speros said there may be ethical issues (ancillary services) around law firms trying to make a profit on e-discovery.
I want to thank Marshall Sklar for suggesting a correction to my notes.





