DESI (Discovery of Electronically Stored Information) is a one-day workshop within ICAIL (International Conference on Artificial Intelligence and Law), which is held every other year. The conference was held in London last month. Rumor has it that the next ICAIL will be in North America, perhaps Montreal.
I’m not going to go into the DESI talks based on papers and slides that are posted on the DESI VII website since you can read that content directly. The workshop opened with a keynote by Maura Grossman and Gordon Cormack where they talked about the history of TREC tracks that are relevant to e-discovery (Spam, Legal, and Total Recall), the limitation on the recall that can be achieved due to ambiguous relevance (reviewer disagreement) for some documents, and the need for high recall when it comes to identifying privileged documents or documents where privacy must be protected. When looking for privileged documents it is important to note that many tools don’t make use of metadata. Documents that are missed may be technically relevant but not really important — you should look at a sample to see whether they are important.
Between presentations based on submitted papers there was a lunch where people separated into four groups to discuss specific topics. The first group focused on e-discovery users. Visualizations were deemed “nice to look at” but not always useful — does the visualization help you to answer a question faster? Another group talked about how to improve e-discovery, including attorney aversion to algorithms and whether a substantial number of documents could be missed by CAL after the gain curve had plateaued. Another group discussed dreams about future technologies, like better case assessment and redacting video. The fourth group talked about GDPR and speculated that the UK would obey GDPR.
DESI ended with a panel discussion about future directions for e-discovery. It was suggested that a government or consumer group should evaluate TAR systems. Apparently, NIST doesn’t want to do it because it is too political. One person pointed out that consumers aren’t really demanding it. It’s not just a matter of optimizing recall and precision — process (quality control and workflow) matters, which makes comparisons hard. It was claimed that defense attorneys were motivated to lobby against the federal rules encouraging the use of TAR because they don’t want incriminating things to be found. People working in archiving are more enthusiastic about TAR.
Following DESI (and other workshops conducted in parallel on the first day), ICAIL had three more days of paper presentations followed by another day of workshops. You can find the schedule is here. I only attended the first day of non-DESI presentations. There are two papers from that day that I want to point out. The first is Effectiveness Results for Popular e-Discovery Algorithms by Yang, David Grossman, Frieder, and Yurchak. They compared performance of the CAL (relevance feedback) approach to TAR for several different classification algorithms, feature types, feature weightings, and with/without LSI. They used several different performance metrics, though they missed the one I think is most relevant for e-discovery (review effort required to achieve an acceptable level of recall). Still, it is interesting to see such an exhaustive comparison of algorithms used in TAR / predictive coding. They’ve made their code available here. The second paper is Scenario Analytics: Analyzing Jury Verdicts to Evaluate Legal Case Outcomes by Conrad and Al-Kofahi. The authors analyze a large database of jury verdicts in an effort to determine the feasibility of building a system to give strategic litigation advice (e.g., potential award size, trial duration, and suggested claims) based on a data-driven analysis of the case.