Bill Dimm will be speaking with John Tredennick and Tom Gricks on the TAR Talk podcast about his recent article TAR, Proportionality, and Bad Algorithms (1-NN). The podcast will be on Tuesday, November 20, 2018. You can register here or download it later on iTunes or Google Play.
This iteration of the challenge was performed during the Digging into TAR session at the 2018 Northeast eDiscovery & IG Retreat. The structure was similar to round 3, but the audience was bigger. As before, the goal was to see whether the audience could construct a keyword search query that performed better than technology-assisted review.
There are two sensible ways to compare performance. Either see which approach reaches a fixed level of recall with the least review effort, or see which approach reaches the highest level of recall with a fixed amount of review effort. Any approach comparing results having different recall and different review effort cannot give a definitive conclusion on which result is best without making arbitrary assumptions about a trade off between recall and effort (this is why performance measures, such as the F1 score, that mix recall and precision together are not sensible for ediscovery).
For the challenge we fixed the amount of review effort and measured the recall achieved, because that was an easier process to carry out under the circumstances. Specifically, we took the top 3,000 documents matching the search query, reviewed them (this was instantaneous because the whole population was reviewed in advance), and measured the recall achieved. That was compared to the recall for a TAR 3.0 process where 200 cluster centers were reviewed for training and then the top-scoring 2,800 documents were reviewed. If the system was allowed to continue learning while the top-scoring documents were reviewed, the result was called “TAR 3.0 CAL.” If learning was terminated after review of the 200 cluster centers, the result was called “TAR 3.0 SAL.” The process was repeated with 6,000 documents instead of 3,000 so you can see how much recall improves if you double the review effort.
Individuals in the audience submitted queries through a web form using smart phones or laptops and I executed some (due to limited time) of the queries in front of the audience. They could learn useful keywords from the documents matching the queries and tweak their queries and resubmit them. Unlike a real ediscovery project, they had very limited time and no familiarity with the documents. The audience could choose to work on any of three topics: biology, medical industry, or law. In the results below, the queries are labeled with the submitters’ initials (some people gave only a first name, so there is only one initial) followed by a number if they submitted more than one query. Two queries were omitted because they had less than 1% recall (the participants apparently misunderstood the task). The queries that were evaluated in front of the audience were E-1, U, AC-1, and JM-1. The discussion of the result follows the tables, graphs, and queries.
|Query||Top 3,000||Top 6,000|
|TAR 3.0 SAL||72.5%||91.0%|
|TAR 3.0 CAL||75.5%||93.0%|
|Query||Top 3,000||Top 6,000|
|TAR 3.0 SAL||67.3%||83.7%|
|TAR 3.0 CAL||80.7%||88.5%|
|Query||Top 3,000||Top 6,000|
|TAR 3.0 SAL||63.5%||82.3%|
|TAR 3.0 CAL||77.8%||87.8%|
E-1) biology OR microbiology OR chemical OR pharmacodynamic OR pharmacokinetic
E-2) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence
E-3) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence OR pharmacokinetic OR processes OR lysis
E-4) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence OR pharmacokinetic OR processes OR lysis OR study
E-5) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence OR pharmacokinetic OR processes OR lysis OR study OR table
E-6) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence OR pharmacokinetic OR processes OR lysis OR study OR table OR research
U) Transplant OR organ OR cancer OR hypothesis
AC-2) legal OR attorney OR (defendant AND plaintiff) OR precedent OR verdict OR deliberate OR motion OR dismissed OR granted
JM-1) Law OR legal OR attorney OR lawyer OR litigation OR liability OR lawsuit OR judge
JM-2) Law OR legal OR attorney OR lawyer OR litigation OR liability OR lawsuit OR judge OR defendant OR plaintiff OR court OR plaintiffs OR attorneys OR lawyers OR defense
K-1) Law OR lawyer OR attorney OR advice OR litigation OR court OR investigation OR subpoena
K-2) Law OR lawyer OR attorney OR advice OR litigation OR court OR investigation OR subpoena OR justice
C) (law OR legal OR criminal OR civil OR litigation) AND NOT (politics OR proposed OR pending)
R) Court OR courtroom OR judge OR judicial OR judiciary OR law OR lawyer OR legal OR plaintiff OR plaintiffs OR defendant OR defendants OR subpoena OR sued OR suing OR sue OR lawsuit OR injunction OR justice
None of the keyword searches achieved higher recall than TAR when the amount of review effort was equal. All six of the biology queries were submitted by one person. The first query was evaluated in front of the audience, and his first revision to the query did help, but subsequent (blind) revisions of the query tended to hurt more than they helped. For biology, review of 3,000 documents with TAR gave better recall than review of 6,000 documents with any of the queries. There was only a single query submitted for the medical industry, and it underperformed TAR substantially. Five people submitted a total of eight queries for the law category, and the audience had the best results for that topic, which isn’t surprising since an audience full of lawyers and litigation support people would be expected to be especially good at identifying keywords related to the law. Even the best queries had lower recall with review of 6,000 documents than TAR 3.0 CAL achieved with review of only 3,000 documents, but a few of the queries did achieve higher recall than TAR 3.0 SAL when twice as much document review was performed with the search query compared to TAR 3.0 SAL.
This iteration of the challenge, held at the Education Hub at ILTACON 2018, was structured somewhat differently from round 1 and round 2 to give the audience a better chance of beating TAR. Instead of submitting search queries on paper, participants submitted them through a web form using their phones, which allowed them to repeatedly tweak their queries and resubmit them. I executed the queries in front of the participants, so they could see the exact recall achieved (since all documents were marked as relevant or non-relevant by a human reviewer in advance) almost instantaneously and they could utilize the performance information for their queries and the queries of other participants to guide improvements to their queries. This actually gave the participants an advantage over what they would experience in a real e-discovery project since performance measurements would normally require human evaluation of a random sample from the search output, which would make execution of several iterations of a query guided by performance evaluations very expensive in terms of review labor. The audience got those performance evaluations for free even though the goal was to compare recall achieved for equal amounts of document review effort. On the other hand, the audience did still have the disadvantages of having limited time and no familiarity with the documents.
As before, recall was evaluated for the top 3000 and top 6000 documents, which was enough to achieve high recall with TAR (even with the training documents included, so total review effort for TAR and the search queries was the same). Audience members were free to work on any of the three topics that were used in previous versions of the challenge: law, medical industry, or biology. Unfortunately, the audience was much smaller than previous versions of the challenge, and nobody chose to submit a query for the biology topic.
Previously, the TAR results were achieved by using the TAR 3.0 workflow to train with 200 cluster centers, documents were sorted based on the resulting relevance scores, and top-scoring documents were reviewed until the desired amount of review effort was expended without allowing predictions to be updated during that review (e.g., review of 200 training docs plus 2,800 top scoring docs to get the “Top 3,000” result). I’ll call this TAR 3.0 SAL (SAL = Simple Active Learning, meaning the system is not allowed to learn during the review of top-scoring documents). In practice you wouldn’t do that. If you were reviewing top-scoring documents, you would allow the system to continue learning (CAL). You would use SAL only if you were producing top-scoring documents without reviewing them since allowing learning to continue during the review would reduce the amount of review needed to achieve a desired level of recall. I used TAR 3.0 SAL in previous iterations because I wanted to simulate the full review in front of the audience in a few seconds and TAR 3.0 CAL would have been slower. This time, I did the TAR calculations in advance and present both the SAL and CAL results so you can see how much difference the additional learning from CAL made.
One other difference compared to previous versions of the challenge is how I’ve labeled the queries below. This time, the number indicates which participant submitted the query and the letter indicates which one of his/her queries are being analyzed (if the person submitted more than one) rather than indicating a tweaking of the query that I added to try to improve the result. In other words, all variations were tweaks done by the audience instead of by me. Discussion of the results follows the tables, graphs, and queries below.
|Medical Industry||Top 3,000||Top 6,000|
|TAR 3.0 SAL||67.3%||83.7%|
|TAR 3.0 CAL||80.7%||88.5%|
|Law||Top 3,000||Top 6,000|
|TAR 3.0 SAL||63.5%||82.3%|
|TAR 3.0 CAL||77.8%||87.8%|
1a) Hospital AND New AND therapies
1b) Hospital AND New AND (physicians OR doctors)
2) Copyright AND mickey AND mouse
3a) Schedule OR Amendments OR Trial OR Jury OR Judge OR Circuit OR Courtroom OR Judgement
3b) Amendments OR Trial OR Jury OR Judge OR Circuit OR Courtroom OR Judgement OR trial OR law OR Patent OR legal
3c) Amendments OR Trial OR Jury OR Judge OR Circuit OR Courtroom OR Judgement OR trial OR law OR Patent OR legal OR Plaintiff OR Defendant
4) Privacy OR (Personally AND Identifiable AND Information) OR PII OR (Protected AND Speech)
TAR won across the board, as in previous iterations of the challenge. Only one person submitted queries for the medical industry topic. His/her revised query did a better job of finding relevant documents, but still returned fewer than 3,000 documents and fared far worse than TAR — the query was just not broad enough to achieve high recall. Three people submitted queries on the law topic. One of those people revised the query a few times and got decent results (shown in green), but still fell far short of the TAR result, with review of 6,000 documents from the best query finding fewer relevant documents than review of half as many documents with TAR 3.0 SAL (TAR 3.0 CAL did even better). It is unfortunate that the audience was so small, since a larger audience might have done better by learning from each other’s submissions. Hopefully I’ll be able to do this with a bigger audience in the future.
Should proportionality arguments allow producing parties to get away with poor productions simply because they wasted a lot of effort due to an extremely bad algorithm? This article examines one such bad algorithm that has been used in major review platforms, and shows that it could be made vastly more effective with a very minor tweak. Are lawyers who use platforms lacking the tweak committing malpractice by doing so?
Last year I was moderating a panel on TAR (predictive coding) and I asked the audience what recall level they normally aim for when using TAR. An attendee responded that it was a bad question because proportionality only required a reasonable effort. Much of the audience expressed agreement. This should concern everyone. If quality of result (e.g., achieving a certain level of recall) is the goal, the requesting party really has no business asking how the result was achieved–any effort wasted by choosing a bad algorithm is born by the producing party. On the other hand, if the target is expenditure of a certain amount of effort, doesn’t the requesting party have the right to know and object if the producing party has chosen a methodology that is extremely inefficient?
The algorithm I’ll be picking on today is a classifier called 1-nearest neighbor, or 1-NN. You may be using it without ever having heard that name, so pay attention to my description of it and see if it sounds familiar. To predict whether a document is relevant, 1-NN finds the single most similar training document and predicts the relevance of the unreviewed document to be the same. If a relevance score is desired instead of a yes/no relevance prediction, the relevance score can be taken to be the similarity value if the most similar training document is relevant, and it can be taken to be the negative of the similarity value if the most similar training document is non-relevant. Here is a precision-recall curve for the 1-NN algorithm used in a TAR 1.0 workflow trained with randomly-selected documents:
The precision falls off a cliff above 60% recall. This is not due to inadequate training–the cliff shown above will not go away no matter how much training data you add. To understand the implications, realize that if you sort the documents by relevance score and review from the top down until you reach the desired level of recall, 1/P at that recall tells the average number of documents you’ll review for each relevant document you find. At 60% recall, precision is 67%, so you’ll review 1.5 documents (1/0.67 = 1.5) for each relevant document you find. There is some effort wasted in reviewing those 0.5 non-relevant documents for each relevant document you find, but it’s not too bad. If you keep reviewing documents until you reach 70% recall, things get much worse. Precision drops to about 8%, so you’ll encounter so many non-relevant documents after you get past 60% recall that you’ll end up reviewing 12.5 documents for each relevant document you find. You would surely be tempted to argue that proportionality says you should be able to stop at 60% recall because the small gain in result quality of going from 60% recall to 70% recall would cost nearly ten times as much review effort. But does it really have to be so hard to get to 70% recall?
It’s very easy to come up with an algorithm that can reach higher recall without so much review effort once you understand why the performance cliff occurs. When you sort the documents by relevance score with 1-NN, the documents where the most similar training document is relevant will be at the top of the list. The performance cliff occurs when you start digging into the documents where the most similar training document is non-relevant. The 1-NN classifier does a terrible job of determining which of those documents has the best chance of being relevant because it ignores valuable information that is available. Consider two documents, X and Y, that both have a non-relevant training document as the most similar training document, but document X has a relevant training document as the second most similar training document and document Y has a non-relevant training document as the second most similar. We would expect X to have a better chance of being relevant than Y, all else being equal, but 1-NN cannot distinguish between the two because it pays no attention to the second most similar training document. Here is the result for 2-NN, which takes the two most similar training document into account:
Notice that 2-NN easily reaches 70% recall (1/P is 1.6 instead of 12.5), but it does have a performance cliff of its own at a higher level of recall because it fails to make use of information about the third most similar training document. If we utilize information about the 40 most similar training documents we get much better performance as shown by the solid lines here:
It was the presence of non-relevant training documents that tripped up the 1-NN algorithm because the non-relevant training document effectively hid the existence of evidence (similar training documents that were relevant) that a document might be relevant, so you might think the performance cliff could be avoided by omitting non-relevant documents from the training. The result of doing that is shown with dashed lines in the figure above. Omitting non-relevant training documents does help 1-NN at high recall, though it is still far worse than 40-NN with the non-relevant training documents include (omitting the non-relevant training documents actually harms 40-NN, as shown by the red dashed line). A workflow that focuses on reviewing documents that are likely to be relevant, such as TAR 2.0, rather than training with random documents, will be less impacted by 1-NN’s shortcomings, but why would you ever suffer the poor performance of 1-NN when 40-NN requires such a minimal modification of the algorithm?
You might wonder whether the performance cliff shown above is just an anomaly. Here are precision-recall curves for several additional categorization tasks with 1-NN on the left and 40-NN on the right.
Sometimes the 1-NN performance cliff occurs at high enough recall to allow a decent production, but sometimes it keeps you from finding even half of the relevant documents. Should a court accept less than 50% recall when the most trivial tweak to the algorithm could have achieved much higher recall with roughly the same amount of document review?
Of course, there are many factors beyond the quality of the classifier, such as the choice of TAR 1.0 (SPL and SAL), TAR 2.0 (CAL), or TAR 3.0 workflows, that impact the efficiency of the process. The research by Grossman and Cormack that courts have relied upon to justify the use of TAR because it reaches recall that is comparable to or better than an exhaustive human review is based on CAL (TAR 2.0) with good classifiers, whereas some popular software uses TAR 1.0 (less efficient if documents will be reviewed before production) and poor classifiers such as 1-NN. If the producing party vows to reach high recall and bears the cost of choosing bad software and/or processes to achieve that, there isn’t much for the requesting party to complain about (though the producing party could have a bone to pick with an attorney or service provider who recommended an inefficient approach). On the other hand, if the producing party argues that low recall should be tolerated because decent recall would require too much effort, it seems that asking whether the algorithms used are unnecessarily inefficient would be appropriate.
During my presentation at the South Central eDiscovery & IG Retreat I challenged the audience to create keyword searches that would work better than technology-assisted review (predictive coding). This is similar to the experiment done a few months earlier. See this article for more details. The audience again worked in groups to construct keyword searches for two topics. One topic, articles on law, was the same as last time. The other topic, the medical industry, was new (it replaced biology).
Performance was evaluated by comparing the recall achieved for equal amounts of document review effort (the population was fully categorized in advance, so measurements are exact, not estimates). Recall for the top 3000 keyword search matches was compared to recall from reviewing 202 training documents (2 seed documents plus 200 cluster centers using the TAR 3.0 method) and 2798 documents having the highest relevance scores from TAR. Similarly, recall from the top 6000 keyword search matches was compared to recall from review of 6000 documents with TAR. Recall from all documents matching a search query was also measured to find the maximum recall that could be achieved with the query.
The search queries are shown after the performance tables and graphs. When there is an “a” and “b” version of the query, the “a” version was the audience’s query as-is, and the “b” query was tweaked by me to remove restrictions that were limiting the number of relevant documents that could be found. The results are discussed at the end of the article.
|Query||Total Matches||Top 3,000||Top 6,000||All|
|Query||Total Matches||Top 3,000||Top 6,000||All|
1a) medical AND (industry OR business) AND NOT (scientific OR research)
1b) medical AND (industry OR business)
2) (revenue OR finance OR market OR brand OR sales) AND (hospital OR health OR medical OR clinical)
3a) (medical OR hospital OR doctor) AND (HIPPA OR insurance)
3b) medical OR hospital OR doctor OR HIPPA OR insurance
4a) (earnings OR profits OR management OR executive OR recall OR (board AND directors) OR healthcare OR medical OR health OR hospital OR physician OR nurse OR marketing OR pharma OR report OR GlaxoSmithKline OR (united AND health) OR AstraZeneca OR Gilead OR Sanofi OR financial OR malpractice OR (annual AND report) OR provider OR HMO OR PPO OR telemedicine) AND NOT (study OR research OR academic)
4b) earnings OR profits OR management OR executive OR recall OR (board AND directors) OR healthcare OR medical OR health OR hospital OR physician OR nurse OR marketing OR pharma OR report OR GlaxoSmithKline OR (united AND health) OR AstraZeneca OR Gilead OR Sanofi OR financial OR malpractice OR (annual AND report) OR provider OR HMO OR PPO OR telemedicine
5) FRCP OR Fed OR litigation OR appeal OR immigration OR ordinance OR legal OR law OR enact OR code OR statute OR subsection OR regulation OR rules OR precedent OR (applicable AND law) OR ruling
6) judge OR (supreme AND court) OR court OR legislation OR legal OR lawyer OR judicial OR law OR attorney
As before, TAR won across the board, but there were some surprises this time.
For the medical industry topic, review of 3000 documents with TAR achieved higher recall than any keyword search achieved with review of 6000 documents, very similar to results from a few months ago. When all documents matching the medical industry search queries were analyzed, two queries did achieve high recall (3b and 4b, which are queries I tweaked to achieve higher recall), but they did so by retrieving a substantial percentage of the 100,000 document population (16,756 and 58,510 documents respectively). TAR can reach any level of recall by simply taking enough documents from the sorted list—TAR doesn’t run out of matches like a keyword search does. TAR matches the 94.6% recall that query 4b achieved (requiring review of 58,510 documents) with review of only 15,500 documents.
Results for the law topic were more interesting. The two queries submitted for the law topic both performed better than any of the queries submitted for that topic a few months ago. Query 6 gave the best results, with TAR beating it by only a modest amount. If all 25,370 documents matching query 6 were reviewed, 95.7% recall would be achieved, which TAR could accomplish with review of 24,000 documents. It is worth noting that TAR 2.0 would be more efficient, especially at very high recall. TAR 3.0 gives the option to produce documents without review (not utilized for this exercise), plus computations are much faster due to there being vastly fewer training documents, which is handy for simulating a full review live in front of an audience in a few seconds.
During my presentation at the NorCal eDiscovery & IG Retreat I challenged the audience to create keyword searches that would work better than technology-assisted review (predictive coding) for two topics. Half of the room was tasked with finding articles about biology (science-oriented articles, excluding medical treatment) and the other half searched for articles about current law (excluding proposed laws or politics). I ran one of the searches against TAR in Clustify live during the presentation (Clustify’s “shadow tags” feature allows a full document review to be simulated in a few minutes using documents that were pre-categorized by human reviewers), but couldn’t do the rest due to time constraints. This article presents the results for all the queries submitted by the audience.
The audience had limited time to construct queries (working together in groups), they weren’t familiar with the data set, and they couldn’t do sampling to tune their queries, so I’m not claiming the exercise was comparable to an e-discovery project. Still, it was entertaining. The topics are pretty simple, so a large percentage of the relevant documents can be found with a pretty simple search using some broad terms. For example, a search for “biology” would find 37% of the biology documents. A search for “law” would find 71% of the law articles. The trick is to find the relevant documents without pulling in too many of the non-relevant ones.
To evaluate the results, I measured the recall (percentage of relevant documents found) from the top 3,000 and top 6,000 hits on the search query (3% and 6% of the population respectively). I’ve also included the recall achieved by looking at all docs that matched the search query, just to see what recall the search queries could achieve if you didn’t worry about pulling in a ton of non-relevant docs. For the TAR results I used TAR 3.0 trained with two seed documents (one relevant from a keyword search and one random non-relevant document) followed by 20 iterations of 10 top-scoring cluster centers, so a total of 202 training documents (no control set needed with TAR 3.0). To compare to the top 3,000 search query matches, the 202 training documents plus 2,798 top-scoring documents were used for TAR, so the total document review (including training) would be the same for TAR and the search query.
The search engine in Clustify is intended to help the user find a few seed documents to get active learning started, so it has some limitations. If the audience’s search query included phrases, they were converted an AND search enclosed in parenthesis. If the audience’s query included a wildcard, I converted it to a parenthesized OR search by looking at the matching words in the index and selecting only the ones that made sense (i.e., I made the queries better than they would have been with an actual wildcard). I noticed that there were a lot of irrelevant words that matched the wildcards. For example, “cell*” in a biology search should match cellphone, cellular, cellar, cellist, etc., but I excluded such words. I would highly recommend that people using keyword search check to see what their wildcards are actually matching–you may be pulling in a lot of irrelevant words. I removed a few words from the queries that weren’t in the index (so the words shown all actually had an impact). When there is an “a” and “b” version of the query, the “a” version was the audience’s query as-is, and the “b” query was tweaked by me to retrieve more documents.
The tables below show the results. The actual queries are displayed below the tables. Discussion of the results is at the end.
|Query||Total Matches||Top 3,000||Top 6,000||All Matches|
|Query||Total Matches||Top 3,000||Top 6,000||All Matches|
1) organism OR microorganism OR species OR DNA
2) habitat OR ecology OR marine OR ecosystem OR biology OR cell OR organism OR species OR photosynthesis OR pollination OR gene OR genetic OR genome AND NOT (treatment OR generic OR prognosis OR placebo OR diagnosis OR FDA OR medical OR medicine OR medication OR medications OR medicines OR medicated OR medicinal OR physician)
3) biology OR plant OR (phyllis OR phylos OR phylogenetic OR phylogeny OR phyllo OR phylis OR phylloxera) OR animal OR (cell OR cells OR celled OR cellomics OR celltiter) OR (circulation OR circulatory) OR (neural OR neuron OR neurotransmitter OR neurotransmitters OR neurological OR neurons OR neurotoxic OR neurobiology OR neuromuscular OR neuroscience OR neurotransmission OR neuropathy OR neurologically OR neuroanatomy OR neuroimaging OR neuronal OR neurosciences OR neuroendocrine OR neurofeedback OR neuroscientist OR neuroscientists OR neurobiologist OR neurochemical OR neuromorphic OR neurohormones OR neuroscientific OR neurovascular OR neurohormonal OR neurotechnology OR neurobiologists OR neurogenetics OR neuropeptide OR neuroreceptors) OR enzyme OR blood OR nerve OR brain OR kidney OR (muscle OR muscles) OR dna OR rna OR species OR mitochondria
4a) statistically AND ((laboratory AND test) OR species OR (genetic AND marker) OR enzyme) AND NOT (diagnosis OR treatment OR prognosis)
4b) (species OR (genetic AND marker) OR enzyme) AND NOT (diagnosis OR treatment OR prognosis)
5a) federal AND (ruling OR judge OR justice OR (appellate OR appellant))
5b) ruling OR judge OR justice OR (appellate OR appellant)
6) amendments OR FRE OR whistleblower
7) ((law OR laws OR lawyer OR lawyers OR lawsuit OR lawsuits OR lawyering) OR (regulation OR regulations) OR (statute OR statutes) OR (standards)) AND NOT pending
TAR beat keyword search across the board for both tasks. The top 3,000 documents returned by TAR achieved higher recall than the top 6,000 documents for any keyword search. In other words, if documents will be reviewed before production, TAR achieves better results (higher recall) with half as much document review compared to any of the keyword searches. The top 6,000 documents returned by TAR achieved higher recall than all of the documents matching any individual keyword search, even when the keyword search returned 27,000 documents.
DESI (Discovery of Electronically Stored Information) is a one-day workshop within ICAIL (International Conference on Artificial Intelligence and Law), which is held every other year. The conference was held in London last month. Rumor has it that the next ICAIL will be in North America, perhaps Montreal.
I’m not going to go into the DESI talks based on papers and slides that are posted on the DESI VII website since you can read that content directly. The workshop opened with a keynote by Maura Grossman and Gordon Cormack where they talked about the history of TREC tracks that are relevant to e-discovery (Spam, Legal, and Total Recall), the limitation on the recall that can be achieved due to ambiguous relevance (reviewer disagreement) for some documents, and the need for high recall when it comes to identifying privileged documents or documents where privacy must be protected. When looking for privileged documents it is important to note that many tools don’t make use of metadata. Documents that are missed may be technically relevant but not really important — you should look at a sample to see whether they are important.
Between presentations based on submitted papers there was a lunch where people separated into four groups to discuss specific topics. The first group focused on e-discovery users. Visualizations were deemed “nice to look at” but not always useful — does the visualization help you to answer a question faster? Another group talked about how to improve e-discovery, including attorney aversion to algorithms and whether a substantial number of documents could be missed by CAL after the gain curve had plateaued. Another group discussed dreams about future technologies, like better case assessment and redacting video. The fourth group talked about GDPR and speculated that the UK would obey GDPR.
DESI ended with a panel discussion about future directions for e-discovery. It was suggested that a government or consumer group should evaluate TAR systems. Apparently, NIST doesn’t want to do it because it is too political. One person pointed out that consumers aren’t really demanding it. It’s not just a matter of optimizing recall and precision — process (quality control and workflow) matters, which makes comparisons hard. It was claimed that defense attorneys were motivated to lobby against the federal rules encouraging the use of TAR because they don’t want incriminating things to be found. People working in archiving are more enthusiastic about TAR.
Following DESI (and other workshops conducted in parallel on the first day), ICAIL had three more days of paper presentations followed by another day of workshops. You can find the schedule is here. I only attended the first day of non-DESI presentations. There are two papers from that day that I want to point out. The first is Effectiveness Results for Popular e-Discovery Algorithms by Yang, David Grossman, Frieder, and Yurchak. They compared performance of the CAL (relevance feedback) approach to TAR for several different classification algorithms, feature types, feature weightings, and with/without LSI. They used several different performance metrics, though they missed the one I think is most relevant for e-discovery (review effort required to achieve an acceptable level of recall). Still, it is interesting to see such an exhaustive comparison of algorithms used in TAR / predictive coding. They’ve made their code available here. The second paper is Scenario Analytics: Analyzing Jury Verdicts to Evaluate Legal Case Outcomes by Conrad and Al-Kofahi. The authors analyze a large database of jury verdicts in an effort to determine the feasibility of building a system to give strategic litigation advice (e.g., potential award size, trial duration, and suggested claims) based on a data-driven analysis of the case.
A recent decision by Master Matthews in Pyrrho Investments v. MWB Property seems to be the first judgment by a UK court allowing the use of predictive coding. This article comments on a few aspects of the decision, especially the conclusion about how predictive coding (or TAR) performs compared to manual review.
The decision argues that predictive coding is not prohibited by English law and that it is reasonable based on proportionality, the details of the case, and expected accuracy compared to manual review. It recaps the Da Silva Moore v. Publicis Group case from the US starting at paragraph 26, and the Irish Bank Resolution Corporation v. Quinn case from Ireland starting at paragraph 31.
Paragraph 33 enumerates ten reasons for approving predictive coding. The second reason on the list is:
There is no evidence to show that the use of predictive coding software leads to less accurate disclosure being given than, say, manual review alone or keyword searches and manual review combined, and indeed there is some evidence (referred to in the US and Irish cases to which I referred above) to the contrary.
The evidence referenced includes the famous Grossman & Cormack JOLT study, but that study only analyzed the TAR systems from TREC 2009 that had the best results. If you look at all of the TAR results from TREC 2009, as I did in Appendix A of my book, many of the TAR systems found fewer relevant documents (albeit at much lower cost) than humans performing manual review. This figure shows the number of relevant documents found:
If a TAR system generates relevance scores rather than binary yes/no relevance predictions, any desired recall can be achieved by producing all documents having relevance scores above an appropriately calculated cutoff. Aiming for high recall with a system that is not working well may mean producing a lot of non-relevant documents or performing a lot of human review on the documents predicted to be relevant (i.e., documents above the relevance score cutoff) to filter out the large number of non-relevant documents that the system failed to separate from the relevant ones (possibly losing some relevant documents in the process due to reviewer mistakes). If it is possible (through enough effort) to achieve high recall with a system that is performing poorly, why were so many TAR results far below the manual review results? TREC 2009 participants were told they should aim to maximize their F1 scores (F1 is not a good choice for e-discovery). Effectively, participants were told to choose their relevance score cutoffs in a way that tried to balance the desire for high recall with other concerns (high precision). If a system wasn’t performing well, maximizing F1 meant either accepting low recall or reviewing a huge number of documents to achieve high recall without allowing too many non-relevant documents to slip into the production.
The key point is that the number of relevant documents found depends on how the system is used (e.g., how the relevance score cutoff is chosen). The amount of effort required (amount of human document review) to achieve a desired level of recall depends on how well the system and training methodology work, which can vary quite a bit (see this article). Achieving results that are better than manual review (in terms of the number of relevant documents found) does not happen automatically just because you wave the word “TAR” around. You either need a system that works well for the task at hand, or you need to be willing to push a poor system far enough (low relevance score cutoff and lots of document review) to achieve good recall. The figure above should make it clear that it is possible for TAR to give results that fall far short of manual review if it is not pushed hard enough.
The discussion above focuses on the quality of the result, but the cost of achieving the result is obviously a significant factor. Page 14 of the decision says the case involves over 3 million documents and the cost of the predictive coding software is estimated to be between £181,988 and £469,049 (plus hosting costs) depending on factors like the number of documents culled via keyword search. If we assume the high end of the price range applies to 3 million documents, that works out to $0.22 per document, which is about ten times what it could be if they shopped around, but still much cheaper than human review.
This article reviews TAR 1.0, 2.0, and the new TAR 3.0 workflow. It then compares performance on seven categorization tasks of varying prevalence and difficulty. You may find it useful to read my article on gain curves before reading this one.
In some circumstances it may be acceptable to produce documents without reviewing all of them. Perhaps it is expected that there are no privileged documents among the custodians involved, or maybe it is believed that potentially privileged documents will be easy to find via some mechanism like analyzing email senders and recipients. Maybe there is little concern that trade secrets or evidence of bad acts unrelated to the litigation will be revealed if some non-relevant documents are produced. In such situations you are faced with a dilemma when choosing a predictive coding workflow. The TAR 1.0 workflow allows documents to be produced without review, so there is potential for substantial savings if TAR 1.0 works well for the case in question, but TAR 1.0 sometimes doesn’t work well, especially when prevalence is low. TAR 2.0 doesn’t really support producing documents without reviewing them, but it is usually much more efficient than TAR 1.0 if all documents that are predicted to be relevant will be reviewed, especially if the task is difficult or prevalence is low.
TAR 1.0 involves a fair amount of up-front investment in reviewing control set documents and training documents before you can tell whether it is going to work well enough to produce a substantial number of documents without reviewing them. If you find that TAR 1.0 isn’t working well enough to avoid reviewing documents that will be produced (too many non-relevant documents would slip into the production) and you resign yourself to reviewing everything that is predicted to be relevant, you’ll end up reviewing more documents with TAR 1.0 than you would have with TAR 2.0. Switching from TAR 1.0 to TAR 2.0 midstream is less efficient than starting with TAR 2.0. Whether you choose TAR 1.0 or TAR 2.0, it is possible that you could have done less document review if you had made the opposite choice (if you know up front that you will have to review all documents that will be produced due to the circumstances of the case, TAR 2.0 is almost certainly the better choice as far as efficiency is concerned).
TAR 3.0 solves the dilemma by providing high efficiency regardless of whether or not you end up reviewing all of the documents that will be produced. You don’t have to guess which workflow to use and suffer poor efficiency if you are wrong about whether or not producing documents without reviewing them will be feasible. Before jumping into the performance numbers, here is a summary of the workflows (you can find some related animations and discussion in the recording of my recent webinar):
TAR 1.0 involves a training phase followed by a review phase with a control set being used to determine the optimal point when you should switch from training to review. The system no longer learns once the training phase is completed. The control set is a random set of documents that have been reviewed and marked as relevant or non-relevant. The control set documents are not used to train the system. They are used to assess the system’s predictions so training can be terminated when the benefits of additional training no longer outweigh the cost of additional training. Training can be with randomly selected documents, known as Simple Passive Learning (SPL), or it can involve documents chosen by the system to optimize learning efficiency, known as Simple Active Learning (SAL).
TAR 2.0 uses an approach called Continuous Active Learning (CAL), meaning that there is no separation between training and review–the system continues to learn throughout. While many approaches may be used to select documents for review, a significant component of CAL is many iterations of predicting which documents are most likely to be relevant, reviewing them, and updating the predictions. Unlike TAR 1.0, TAR 2.0 tends to be very efficient even when prevalence is low. Since there is no separation between training and review, TAR 2.0 does not require a control set. Generating a control set can involve reviewing a large (especially when prevalence is low) number of non-relevant documents, so avoiding control sets is desirable.
TAR 3.0 requires a high-quality conceptual clustering algorithm that forms narrowly focused clusters of fixed size in concept space. It applies the TAR 2.0 methodology to just the cluster centers, which ensures that a diverse set of potentially relevant documents are reviewed. Once no more relevant cluster centers can be found, the reviewed cluster centers are used as training documents to make predictions for the full document population. There is no need for a control set–the system is well-trained when no additional relevant cluster centers can be found. Analysis of the cluster centers that were reviewed provides an estimate of the prevalence and the number of non-relevant documents that would be produced if documents were produced based purely on the predictions without human review. The user can decide to produce documents (not identified as potentially privileged) without review, similar to SAL from TAR 1.0 (but without a control set), or he/she can decide to review documents that have too much risk of being non-relevant (which can be used as additional training for the system, i.e., CAL). The key point is that the user has the info he/she needs to make a decision about how to proceed after completing review of the cluster centers that are likely to be relevant, and nothing done before that point becomes invalidated by the decision (compare to starting with TAR 1.0, reviewing a control set, finding that the predictions aren’t good enough to produce documents without review, and then switching to TAR 2.0, which renders the control set virtually useless).
The table below shows the amount of document review required to reach 75% recall for seven categorization tasks with widely varying prevalence and difficulty. Performance differences between CAL and non-CAL approaches tend to be larger if a higher recall target is chosen. The document population is 100,000 news articles without dupes or near-dupes. “Min Total Review” is the number of documents requiring review (training documents and control set if applicable) if all documents predicted to be relevant will be produced without review. “Max Total Review” is the number of documents requiring review if all documents predicted to be relevant will be reviewed before production. None of the results include review of statistical samples used to measure recall, which would be the same for all workflows.
|TAR 1.0 SPL||Control Set||300||500||700||1,800||3,000||3,900||6,200|
|Min Total Review||1,300||800||6,700||4,800||4,000||7,900||18,200|
|Max Total Review||10,800||5,200||15,800||9,200||4,900||17,700||21,100|
|TAR 3.0 SAL||Training (Cluster Centers)||400||500||600||300||200||500||300|
|Min Total Review||400||500||600||300||200||500||300|
|Max Total Review||8,400||3,500||12,600||4,500||1,100||8,500||7,600|
|TAR 3.0 CAL||Training (Cluster Centers)||400||500||600||300||200||500||300|
|Training + Review||7,000||3,000||6,700||2,400||900||3,300||1,400|
The size of the control set for TAR 1.0 was chosen so that it would contain approximately 20 relevant documents, so low prevalence requires a large control set. Note that the control set size was chosen based on the assumption that it would be used only to measure changes in prediction quality. If the control set will be used for other things, such as recall estimation, it needs to be larger.
The number of random training documents used in TAR 1.0 was chosen to minimize the Max Total Review result (see my article on gain curves for related discussion). This minimizes total review cost if all documents predicted to be relevant will be reviewed and if the cost of reviewing documents in the training phase and review phase are the same. If training documents will be reviewed by an expensive subject matter expert and the review phase will be performed by less expensive reviewers, the optimal amount of training will be different. If documents predicted to be relevant won’t be reviewed before production, the optimal amount of training will also be different (and more subjective), but I kept the training the same when computing Min Total Review values.
The optimal number of training documents for TAR 1.0 varied greatly for different tasks, ranging from 300 to 12,000. This should make it clear that there is no magic number of training documents that is appropriate for all projects. This is also why TAR 1.0 requires a control set–the optimal amount of training must be measured.
The results labeled TAR 3.0 SAL come from terminating learning once the review of cluster centers is complete, which is appropriate if documents will be produced without review (Min Total Review). The Max Total Review value for TAR 3.0 SAL tells you how much review would be required if you reviewed all documents predicted to be relevant but did not allow the system to learn from that review, which is useful to compare to the TAR 3.0 CAL result where learning is allowed to continue throughout. In some cases where the categorization task is relatively easy (tasks 2 and 5) the extra learning from CAL has no benefit unless the target recall is very high. In other cases CAL reduces review significantly.
I have not included TAR 2.0 in the table because the efficiency of TAR 2.0 with a small seed set (a single relevant document is enough) is virtually indistinguishable from the TAR 3.0 CAL results that are shown. Once you start turning the CAL crank the system will quickly head toward the relevant documents that are easiest for the classification algorithm to identify, and feeding those documents back in for training quickly floods out the influence of the seed set you started with. The only way to change the efficiency of CAL, aside from changing the software’s algorithms, is to waste time reviewing a large seed set that is less effective for learning than the documents that the algorithm would have chosen itself. The training done by TAR 3.0 with cluster centers is highly effective for learning, so there is no wasted effort in reviewing those documents.
To illustrate the dilemma I pointed out at the beginning of the article, consider task 2. The table shows that prevalence is 4.1%, so there are 4,100 relevant documents in the population of 100,000 documents. To achieve 75% recall, we would need to find 3,075 relevant documents. Some of the relevant documents will be found in the control set and the training set, but most will be found in the review phase. The review phase involves 4,400 documents. If we produce all of them without review, most of the produced documents will be relevant (3,075 out of a little more than 4,400). TAR 1.0 would require review of only 800 documents for the training and control sets. By contrast, TAR 2.0 (I’ll use the Total Review value for TAR 3 CAL as the TAR 2.0 result) would produce 3,075 relevant documents with no non-relevant ones (assuming no mistakes by the reviewer), but it would involve reviewing 3,500 documents. TAR 1.0 was better than TAR 2.0 in this case (if producing over a thousand non-relevant documents is acceptable). TAR 3.0 would have been an even better choice because it required review of only 500 documents (cluster centers) and it would have produced fewer non-relevant documents since the review phase would involve only 3,000 documents.
Next, consider task 6. If all 9,800 documents in the review phase of TAR 1.0 were produced without review, most of the production would be non-relevant documents since there are only 520 relevant documents (prevalence is 0.52%) in the entire population! That shameful production would occur after reviewing 7,900 documents for training and the control set, assuming you didn’t recognize the impending disaster and abort before getting that far. Had you started with TAR 2.0, you could have had a clean (no non-relevant documents) production after reviewing just 3,800 documents. With TAR 3.0 you would realize that producing documents without review wasn’t feasible after reviewing 500 cluster center documents and you would proceed with CAL, reviewing a total of 3,800 documents to get a clean production.
Task 5 is interesting because production without review is feasible (but not great) with respect to the number of non-relevant documents that would be produced, but TAR 1.0 is so inefficient when prevalence is low that you would be better off using TAR 2.0. TAR 2.0 would require reviewing 1,100 documents for a clean production, whereas TAR 1.0 would require reviewing 3,000 documents for just the control set! TAR 3.0 beats them both, requiring review of just 200 cluster centers for a somewhat dirty production.
It is worth considering how the results might change with a larger document population. If everything else remained the same (prevalence and difficulty of the categorization task), the size of the control set required would not change, and the number of training documents required would probably not change very much, but the number of documents involved in the review phase would increase in proportion to the size of the population, so the cost savings from being able to produce documents without reviewing them would be much larger.
In summary, TAR 1.0 gives the user the option to produce documents without reviewing them, but its efficiency is poor, especially when prevalence is low. Although the number of training documents required for TAR 1.0 when prevalence is low can be reduced by using active learning (not examined in this article) instead of documents chosen randomly for training, TAR 1.0 is still stuck with the albatross of the control set dragging down efficiency. In some cases (tasks 5, 6, and 7) the control set by itself requires more review labor than the entire document review using CAL. TAR 2.0 is vastly more efficient than TAR 1.0 if you plan to review all of the documents that are predicted to be relevant, but it doesn’t provide the option to produce documents without reviewing them. TAR 3.0 borrows some of best aspects of both TAR 1.0 and 2.0. When all documents that are candidates for production will be reviewed, TAR 3.0 with CAL is just as efficient as TAR 2.0 and has the added benefits of providing a prevalence estimate and a diverse early view of relevant documents. When it is permissible to produce some documents without reviewing them, TAR 3.0 provides that capability with much better efficiency than TAR 1.0 due to its efficient training and elimination of the control set.
If you like graphs, the gain curves for all seven tasks are shown below. Documents used for training are represented by solid lines, and documents not used for training are shown as dashed lines. Dashed lines represent documents that could be produced without review if that is appropriate for the case. A green dot is placed at the end of the review of cluster centers–this is the point where the TAR 3.0 SAL and TAR 3.0 CAL curves diverge, but sometimes they are so close together that it is hard to distinguish them without the dot. Note that review of documents for control sets is not reflected in the gain curves, so the TAR 1.0 results require more document review than is implied by the curves.
There has been a great deal of debate about whether it is wise or possibly even required to disclose seed sets (training documents, possibly including non-relevant documents) when using predictive coding. This article explains why disclosing seed sets may provide far less transparency than people think.
The rationale for disclosing seed sets seems to be that the seed set is the input to the predictive coding system that determines which documents will be produced, so it is reasonable to ask for it to be disclosed so the requesting party can be assured that they will get what they wanted, similar to asking for a keyword search query to be disclosed.
Some argue that the seed set may be work product (if attorneys choose which documents to include rather than using random sampling). Others argue that disclosing non-relevant training documents may reveal a bad act other than the one being litigated. If the requesting party is a competitor, the non-relevant training documents may reveal information that helps them compete. Even if the producing party is not concerned about any of the issues above, it may be reluctant to disclose the seed set due to fear of establishing a precedent it may not want to be stuck with in future cases having different circumstances.
Other people are far more qualified to debate the legal and strategic issues than I am. Before going down that road, I think it’s worthwhile to consider whether disclosing seed sets really provides the transparency that people think. Some reasons why it does not:
- If you were told that the producing party would be searching for evidence of data destruction by doing a keyword search for “shred AND documents,” you could examine that query and easily spot deficiencies. A better search might be “(shred OR destroy OR discard OR delete) AND (documents OR files OR records OR emails OR evidence).” Are you going to review thousands of training documents and realize that one relevant training document contains the words “shred” and “documents” but none of the training documents contain “destroy” or “discard” or “files”? I doubt it.
- You cannot tell whether the seed set is sufficient if you don’t have access to the full document population. There could be substantial pockets of important documents that are not represented in the seed set–how would you know? The producing party has access to the full population, so they can do statistical sampling to measure the quality (based on number of relevant documents, not their importance) of the predictions the training set will produce. The requesting party cannot do that–they have no way of assessing adequacy of the training set other than wild guessing.
- You cannot tell whether the seed set is biased just by looking at it. Again, if you don’t have access to the full population, how could you know if some topic or some particular set of keywords is under or over represented? If training documents were selected by searching for “shred AND Friday,” the system would see both words on all (or most) of the relevant documents and would think both words are equally good indicators of relevance. Would you notice that all the relevant training documents happen to contain the word “Friday”? I doubt it.
- Suppose you see an important document in the seed set that was correctly tagged as being relevant. Can you rest assured that similar documents will be produced? Maybe not. Some classification algorithms can predict a document to be non-relevant when it is a near-dupe or even an exact dupe of a relevant training document. I described how that could happen in this article. How can you claim that the seed set provides transparency if you don’t even know if a near-dupe of a relevant training document will be produced?
- Poor training doesn’t necessarily mean that relevant documents will be missed. If a relevant document fails to match a keyword search query, it will be missed, so ensuring that the query is good is important. Most predictive coding systems generate a relevance score for each document, not just a binary yes/no relevance prediction like a search query. Whether or not the predictive coding system produces a particular relevant document doesn’t depend solely on the training set–the producing party must choose a cutoff point in the ranked document list that determines which documents will be produced. A poorly trained system can still achieve high recall if the relevance score cutoff is chosen to be low enough. If the producing party reviews all documents above the relevance score cutoff before producing them, a poorly trained system will require a lot more document review to achieve satisfactory recall. Unless there is talk of cost shifting, or the producing party is claiming it should be allowed to stop at modest recall because reaching high recall would be too expensive, is it really the requesting party’s concern if the producing party incurs high review costs by training the system poorly?
- One might argue that the producing party could stack the seed set with a large number of marginally relevant documents while avoiding really incriminating documents in order to achieve acceptable recall while missing the most important documents. Again, would you be able to tell that this was done by merely examining the seed set without having access to the full population? Is the requesting party going to complain that there is no smoking gun in the training set? The producing party can simply respond that there are no smoking guns in the full population.
- The seed set may have virtually no impact on the final result. To appreciate this point we need to be more specific about what the seed set is, since people use the term in many different ways (see Grossman & Cormack’s discussion). If the seed set is taken to be a judgmental sample (documents selected by a human, perhaps using keyword search) that is followed by several rounds of additional training using active learning, the active learning algorithm is going to have a much larger impact on the final result than the seed set if active learning contributes a much larger number of relevant documents to the training. In fact, the seed set could be a single relevant document and the result would have almost no dependence on which relevant document was used as the seed (see the “How Seed Sets Influence Which Documents are Found” section of this article). On the other hand, if you take a much broader definition of the seed set and consider it to be all documents used for training, things get a little strange if continuous active learning (CAL) is used. With CAL the documents that are predicted to be relevant are reviewed and the reviewers’ assessments are fed back into the system as additional training to generate new predictions. This is iterated many times. So all documents that are reviewed are used as training documents. The full set of training documents for CAL would be all of the relevant documents that are produced as well as all non-relevant documents that were reviewed along the way. Disclosing the full set of training documents for CAL could involve disclosing a very large number of non-relevant documents (comparable to the number of relevant documents produced).
Trying to determine whether a production will be good by examining a seed set that will be input into a complex piece of software to analyze a document population that you cannot access seems like a fool’s errand. It makes more sense to ask the producing party what recall it achieved and to ask questions to ensure that recall was measured sensibly. Recall isn’t the whole story–it measures the number of relevant documents found, not their importance. It makes sense to negotiate the application of a few keyword searches to the documents that were culled (predicted to be non-relevant) to ensure that nothing important was missed that could easily have been found. The point is that you should judge the production by analyzing the system’s output, not the training data that was input.