May | 2014 | Clustify Blog – eDiscovery, Document Clustering, Technology-Assisted Review (Predictive Coding), Information Retrieval, and Software Development

Understandably, vendors of predictive coding software want to show off numbers indicating that their software works well. It is important for users of such software to avoid drawing wrong conclusions from performance numbers.

Consider the two precision-recall curves below (if you need to brush up on the meaning of precision and recall, see my earlier article):

The one on the left is incredibly good, with 97% precision at 90% recall. The one on the right is not nearly as impressive, with 17% precision at 70% recall, though you could still find 70% of the relevant documents with no additional training by reviewing only the highest-rated 4.7% of the document population (excluding the documents reviewed for training and testing).

Why are the two curves so different? They come from the same algorithm applied to the same document population with the same features (words) analyzed and the exact same random sample of documents used for training. The only difference is the categorization task being attempted, i.e. what type of document we consider to be relevant. Both tasks have nearly the same prevalence of relevant documents (0.986% for the left and 1.131% for the right), but the task on the left is very easy and the one on the right is a lot harder. So, when a vendor quotes performance numbers, you need to keep in mind that they are only meaningful for the specific document set and task that they came from. Performance for a different task or document set may be very different. Comparing a vendor’s performance numbers to those from another source computed for a different categorization task on a different document set would be comparing apples to oranges.

Fair comparison of different predictive coding approaches is difficult, and one must be careful not to extrapolate results from any study too far. As an analogy, consider performing experiments to determine whether fertilizer X works better than fertilizer Y. You might plant marigolds in each fertilizer, apply the same amount of water and sunlight, and measure plant growth. In other words, keep everything the same except the fertilizer. That would give a result that applies to marigolds with the specific amount of sunlight and water used. Would the same result occur for carrots? You might take several different types of plants and apply the same experiment to each to see if there is a consistent winner. What if more water was used? Maybe fertilizer X works better for modest watering (it absorbs and retains water better) and fertilizer Y works better for heavy watering. You might want to present results for different amounts of water so people could choose the optimal fertilizer for the amount of rainfall in their locations. Or, you might determine the optimal amount of water for each, and declare the fertilizer that gives the most growth with its optimal amount of water the winner, which is useful only if gardeners/farmers can adjust water delivery. The number of experiments required to cover every possibility grows exponentially with the number of parameters that can be adjusted.

Predictive coding is more complicated because there are more interdependent parts that can be varied. Comparing classification algorithms on one document set may give a result that doesn’t apply to others, so you might test on several document sets (some with long documents, some with short, some with high prevalence, some with low, etc.), much like testing fertilizer on several types of plants, but that still doesn’t guarantee that a consistent winner will perform best on some untested set of documents. Does a different algorithm win if the amount of training data is higher/lower, similar to a different fertilizer winning if the amount of water is changed? What if the nature of the training data (e.g., random sample vs. active learning) is changed? The training approach can impact different classification algorithms differently (e.g., an active learning algorithm can be optimized for a specific classification algorithm), making the results from a study on one classification algorithm inapplicable to a different algorithm. When comparing two classification algorithms where one is known to perform poorly for high-dimensional data, should you use feature selection techniques to reduce the dimensionality of the data for that algorithm under the theory that that is how it would be used in practice, but knowing that any poor performance may come from removing an important feature rather than from a failure of the classification algorithm itself?

What you definitely should not do is plant a cactus in fertilizer X and a sunflower in fertilizer Y and compare the growth rates to draw a conclusion about which fertilizer is better. Likewise, you should not compare predictive coding performance numbers that came from different document sets or categorization tasks.

The ACEDS E-Discovery Conference was a well-organized conference at a nice venue with two full days of informative sessions. Copies of all slides were provided to attendees, so my note-taking was mostly limited to things that weren’t on the slides, and reflects only a tiny fraction of the information presented. Also, there were sometimes two simultaneous sessions, so I couldn’t attend everything. If you attended, please let me know if you notice any errors in my notes below.

There is some reluctance to use TAR (technology-assisted review) due to a fear that disclosure of the seed set, including privileged and non-responsive documents, will be required.
Royce Cohen said there was a recent case where documents couldn’t be clawed back in spite of having a clawback agreement because attorneys’ eyes had not been put on the produced documents.
A few years ago, use of TAR was typically disclosed, but today very few are disclosing it.
Regarding the requirement for specificity rather than boilerplate in e-discovery objections, Judge Waxse recommended that everyone read the Mancia v. Mayflower Textile Services opinion by Judge Grimm.
Judge Waxse said he resolves e-discovery disputes by putting the parties in a room with a video camera. “Like particles in physics, when lawyers are observed their behavior changes.”
The tension between e-discovery cooperation and zealous advocacy of clients was discussed. It was pointed out that the ABA removed “zealous” from the Model Rules of Professional Conduct (aside from the preamble) in 1983 (sidenote: related article). John Barkett noted that the Federal Rules of Civil Procedure (FRCP) trump ethics rules, anyway.
The preservation trigger is unchanged in the proposed changed to the FRCP.
Stephen Burbank said that if the changes to the FRCP get to Congress, blocking them would require legislation by Congress. It is unlikely the divided Congress would get together to pass such legislation.
Judge Waxse said he didn’t think the proposed changes to the FRCP would have a significant impact on proportionality. The problem with proportionality is not where it is located in the rules, but the difficulty for the court to decide the importance of the case before the trial. He also mentioned a case where one side wanted a protective order claiming e-discovery would cost $30 million, but then dropped that to $3 million when questioned, and ended up being only thousands of dollars. He said judges talk to each other, so be careful about providing bad cost estimates.
On the other hand, Judge Hopkins expects a sea change on proportionality from the new rules.
Judges Hopkins and Otazo-Reyes both said that phasing (e.g., give one out of fifteen custodians to start) is an important tool for proportionality.
Judge Waxse said it is important to establish what is disputed before doing discovery since there is no point in doing discovery on things that aren’t disputed.
Judge Waxse said he thinks it is malpractice to not have a 502(d) order (clawback agreement) in place.
Judge Hopkins said that when documents are clawed back they cannot be “used,” but that is ambiguous. They can’t be used directly in trial, but can the info they contain be used indirectly when questioning for a deposition? Prohibiting indirect use could require changing out the litigation team.
Bill Speros expressed concern that the “marketing view” of TAR (that courts have said clearly that it is OK, and that past studies have proven that it is better than linear review), which is inaccurate, may feed back into the court and distort reality.
Bill Speros predicted that random sampling will fail because prevalence is too low, making it hard to find things that way. He warned that the producing party may be happy to bring in additional custodians to dilute the richness of the document set and reduce the chances of finding anything really damning.
Mary Mack said that predictive coding has been successfully used by receiving parties.
Bill Speros said we should look at concepts/words rather than counting documents to determine whether predictive coding worked. He pointed out that a small number of documents typically contain a large amount of text, so weighting on a document basis tends to undercount the information in the long documents.
When trying to control e-discovery costs, some red flags are: lack of responsiveness, no clarity in billing, and lots of linear review.
Seth Eichenholtz warned that when dealing with international data you have to be careful about just stuffing it all onto a server in the U.S.
When storing e-discovery data in the cloud, be aware of HIPAA requirements if there are any medical records involved.
Law firms using cloud e-discovery services risk losing the connection with the client to the cloud service provider.
Be careful about your right to your data in the cloud, especially upon termination of the contract.
In one case a cloud provider had borrowed money from a bank to purchase hard drives and the bank repossessed the drives (with client data) when the cloud provider had financial trouble.
Be careful about what insurance companies will cover when it comes to data in the cloud.
With TAR, 75% recall is becoming a standard acceptable level.
It’s easier to get agreement on using TAR when both sides of the dispute have a lot of documents, so both benefit from cost savings.
Data should have an expiration date, like milk. If no action is taken to keep it, and there is no litigation hold, it should be deleted automatically.
Predictive coding allows review of the documents that are most likely to be relevant earlier, before the reviewer becomes fatigued and more likely to make mistakes.
Jon Talotta said some law firms internalize e-discovery (rather than outsourcing to a vendor) at no profit to keep the relationship with the client. Some law firms make good money on e-discovery, but only because they are able to make full utilization of the capacity and they have clients that don’t have their own relationships with e-discovery service providers.
A survey of the audience found that most law firms represented were just passing the e-discovery cost through to the client without trying to make a profit.
Bill Speros said there may be ethical issues (ancillary services) around law firms trying to make a profit on e-discovery.