Monthly Archives: May 2019

Highlights from EDRM Workshop 2019

The annual EDRM Workshop was held at Duke Law School starting on the evening of May 15th and ending at lunch time on the 17th. It consisted of a mixture of panels, presentations, working group reports, and working sessions focused on various aspects of e-discovery. I’ve provided some highlights below. You can find my full set of photos here.

Herb Roitblat presented a paper on fear of missing out (FOMO). If 80% recall is achieved, is it legitimate for the requesting party to be concerned about what may have been missed in the 20% of the responsive documents that weren’t produced, or are the facts in that 20% duplicative of the facts found in the 80% that was produced?

A panel discussed the issues faced by in-house counsel. Employees want to use the latest tools, but then you have to worry about how to collect the data (e.g., Skype video recordings). How to preserve an iPhone? What if the phone gets lost or stolen? When doing TAR, can the classifier/model be moved between cases/clients? New vendors need to be able to explain how they are unique, they need to get established (nobody wants to be on the cutting edge, and it’s hard to get a pilot going), and they should realize that it can take a year to get approval. There are security/privacy problems with how law firms handle email. ROI tracking is important. Analytics is used heavily in investigations, and often in litigation, but they currently only use TAR for prioritization and QC, not to cull the population before review. Some law firms are adverse to putting data in the cloud, but cloud providers may have better security than law firms.

The GDPR team is working on educating U.S. judges about GDPR and developing a code of conduct. The EDRM reference will be made easier to update. The AI group is focused on AI in legal (e.g., estimating recidivism, billing, etc.), not implications of AI for the law. The TAR group’s paper is out. The Privilege Logs group wants to avoid duplicating Sedona’s effort (sidenote: lawyers need to learn that an email is not priv just because a lawyer was CC’ed on it). The Stop Words team is trying to educate people about things such as regular expressions, and warned about cases where you want to search for a single letter or a term such as “AN” (for ammonium nitrate). The Proportionality group talked about the possibility of having a standard set of documents that should be produced for certain types of cases and providing guidelines for making proportionality arguments to the court.

A panel of judges said that cybersecurity is currently a big issue. Each court has it’s own approach to security. Rule 16 conferences need to be taken seriously. Judges don’t hire e-discovery vendors, so they don’t know costs. How do you collect a proprietary database? Lawyers can usually work it out without the judge. There is good cooperation when the situations of the parties isn’t too asymmetric. Attorneys need to be more specific in document requests and objections (no boilerplate). Attorneys should know the case better than the judge, and educate the judge in a way that makes the judge look good. Know the client’s IT systems and be aware of any data migration efforts. Stay up on technology (e.g., Slack and text messages). Have a 502(d) order (some people object because they fear the judge will assume priv review is not needed, but the judges didn’t believe that would happen). Protect confidential information that is exchanged (what if there is a breach?). When filing under seal, “attorney’s eyes only” should be used very sparingly, and “confidential” is over used.

TAR vs. Keyword Search Challenge, Round 6 (Instant Feedback)

4 Replies

This was by far the most significant iteration of the ongoing exercise where I challenge an audience to produce a keyword search that works better than technology-assisted review (also known as predictive coding or supervised machine learning). There were far more participants than previous rounds, and a structural change in the challenge allowed participants to get immediate feedback on the performance of their queries so they could iteratively improve them. A total of 1,924 queries were submitted by 42 participants (an average of 45.8 queries per person) and higher recall levels were achieved than in any prior version of the challenge, but the audience still couldn’t beat TAR.

In previous versions of the experiment, the audience submitted search queries on paper or through a web form using their phones, and I evaluated a few of them live on stage to see whether the audience was able to achieve higher recall than TAR. Because the number of live evaluations was so small, the audience had very little opportunity to use the results to improve their queries. In the latest iteration, participants each had their own computer in the lab at the 2019 Ipro Tech Show, and the web form evaluated the query and gave the user feedback on the recall achieved immediately. Furthermore, it displayed the relevance and important keywords for each of the top 100 documents matching the query, so participants could quickly discover useful new search terms to tweak their queries. This gave participants a significant advantage over a normal e-discovery scenario, since they could try an unlimited number of queries without incurring any cost to make relevance determinations on the retrieved documents in order to decide which keywords would improve the queries. The number of participants was significantly larger than any of the previous iterations, and they had a full 20 minutes to try as many queries as they wanted. It was the best chance an audience has ever had of beating TAR. They failed.

To do a fair comparison between TAR and the keyword search results, recall values were compared for equal amounts of document review effort. In other words, for a specified amount of human labor, which approach gave the best production? For the search queries, the top 3,000 documents matching the query were evaluated to determine the number that were relevant so recall could be computed (the full population was reviewed in advance, so the relevance of all documents was known). That was compared to the recall for a TAR 3.0 process where 200 cluster centers were reviewed for training and then the top-scoring 2,800 documents were reviewed. If the system was allowed to continue learning while the top-scoring documents were reviewed, the result was called “TAR 3.0 CAL.” If learning was terminated after review of the 200 cluster centers, the result was called “TAR 3.0 SAL.” The process was repeated with review of 6,000 documents instead of 3,000 so you can see how much recall improves if you double the review effort. Participants could choose to submit queries for any of three topics: biology, medical industry, or law.

The results below labeled “Avg Participant” are computed by finding the highest recall achieved by each participant and averaging those values together. These are surely somewhat inflated values since one would probably not go through so many iterations of honing the queries in practice (especially since evaluating the efficacy of a query would normally involve considerable labor instead of being free and instantaneous), but I wanted to give the participants as much advantage as I could and including all of the queries instead of just the best ones would have biased the results to be too low due to people making mistakes or experimenting with bad queries just to explore the documents. The results labeled “Best Participant” show the highest recall achieved by any participant (computed separately for Top 3,000 and Top 6,000, so they may be different queries).

Biology	Recall
	Top 3,000	Top 6,000
Avg Participant	54.5	69.5
Best Participant	66.0	83.2
TAR 3.0 SAL	72.5	91.0
TAR 3.0 CAL	75.5	93.0

Medical	Recall
	Top 3,000	Top 6,000
Avg Participant	38.5	51.8
Best Participant	46.8	64.0
TAR 3.0 SAL	67.3	83.7
TAR 3.0 CAL	80.7	88.5

Law	Recall
	Top 3,000	Top 6,000
Avg Participant	43.1	59.3
Best Participant	60.5	77.8
TAR 3.0 SAL	63.5	82.3
TAR 3.0 CAL	77.8	87.8

As you can see from the tables above, the best result for any participant never beat TAR (SAL or CAL) when there was an equal amount of document review performed. Furthermore, the average participant result for Top 6,000 never beat the TAR results for Top 3,000, though the best participant result sometimes did, so TAR typically gives a better result even with half as much review effort expended. The graphs below show the best results for each participant compared to TAR in blue. The numbers in the legend are the ID numbers of the participants (the color for a particular participant is not consistent across topics). Click the graph to see a larger version.

The large number of people attempting the biology topic was probably due to it being the default, and I illustrated how to use the software with that topic.

One might wonder whether the participants could have done better if they had more than 20 minutes to work on their queries. The graphs below show the highest recall achieved by any participant as a function of time. You can see that results improved rapidly during the first 10 minutes, but it became hard to make much additional progress beyond that point. Also, over half of the audience continued to submit queries after the 20 minute contest, while I was giving the remainder of the presentation. 40% of the queries were submitted during the first 10 minutes, 40% were submitted during the second 10 minutes, and 20% were submitted while I was talking. Since there were roughly the same number of queries submitted in the second 10 minutes as the first 10 minutes, but much less progress was made, I think it is safe to say that time was not a big factor in the results.

In summary, even with a large pool of participants, ample time, and the ability to hone search queries based on instant feedback, nobody was able to generate a better production than TAR when the same amount of review effort was expended. It seems fair to say that keyword search often requires twice as much document review to achieve a production that is as good as what you would get TAR.

Highlights from Ipro Tech Show 2019

Leave a reply

Ipro renamed their conference from Ipro Innovations to the Ipro Tech Show this year. As always, it was held at the Talking Stick Resort in Arizona and it was very well organized. It started with a reception on April 29th that was followed by two days of talks. There were also training days bookending the conference on April 29th and May 2nd. After the keynote on Tuesday morning, there were five simultaneous tracks for the remainder of the conference, including a lot of hands-on work in computer labs. I was only able to attend a few of the talks, but I’ve included my notes below. You can find my full set of photos here. Videos and slides from the presentations are available here.

Dean Brown, who has been Ipro’s CEO for eight months, opened the conference with some information about himself and where the company is headed. He mentioned that the largest case in a single Ipro database so far was 11 petabytes from 400 million documents. Q1 2019 was the best quarter in the company’s history, and they had a 98% retention rate. They’ve doubled spending on development and other departments.

Next, there was a panel where three industry experts discussed artificial intelligence. AI can be used to analyze legal bills to determine which charges are reasonable. Google uses AI to monitor and prohibit behaviors within the company, such as stopping your account from being used to do things when you are supposed to be away. Only about 5% of the audience said they were using TAR. It was hypothesized that this is due to FRCP 26(g)’s requirement to certify the production as complete and correct. Many people use Slack instead of e-mail, and dealing with that is an issue for e-discovery. CLOC was mentioned as an organization helping corporations get a handle on legal spending.

The keynote was given by Kevin Surace, and mostly focused on AI. You need good data and have to be careful about spurious correlations in the data (he showed various examples that were similar to what you find here). An AI can watch a video and supplement it with text explaining what the person in the video is doing. One must be careful about fast changing patterns and black swan events where there is no data available to model. Doctors are being replaced by software that is better informed about the most recent medical research. AI can review an NDA faster and more accurately than an attorney. There is now a news channel in China using an AI news anchor instead of a human to deliver the news. With autonomous vehicles, transportation will become free (supported by ads in the vehicle). AI will have an impact 100 times larger than the Internet.

I gave a talk titled “Technology: The Cutting Edge and Where We’re Headed” that focused on AI. I started by showing the audience five pairs of images from WhichFaceIsReal.com and challenged them to determine which face was real and which was generated by an AI. When I asked if anyone got all five right, I only saw one person raise their hand. When I asked if anyone got all five wrong, I saw three hands go up. Admittedly, I picked image pairs that I thought were particularly difficult, but the result is still a little scary.

I also gave a talk titled “TAR Versus Keyword Challenge” where I challenged the audience to construct a keyword search that worked better than technology-assisted review. The format of this exercise was very different from previous iterations, making it easy for participants to test and hone their queries. We had 1,924 queries submitted by 42 participants. They achieved the highest recall levels seen so far, but still couldn’t beat TAR. A detailed analysis is available here.

Clustify Blog – eDiscovery, Document Clustering, Technology-Assisted Review (Predictive Coding), Information Retrieval, and Software Development

Thoughts on e-discovery, computers, and software development.

Monthly Archives: May 2019

Highlights from EDRM Workshop 2019

TAR vs. Keyword Search Challenge, Round 6 (Instant Feedback)

Highlights from Ipro Tech Show 2019