During my presentation at the NorCal eDiscovery & IG Retreat I challenged the audience to create keyword searches that would work better than technology-assisted review (predictive coding) for two topics. Half of the room was tasked with finding articles about biology (science-oriented articles, excluding medical treatment) and the other half searched for articles about current law (excluding proposed laws or politics). I ran one of the searches against TAR in Clustify live during the presentation (Clustify’s “shadow tags” feature allows a full document review to be simulated in a few minutes using documents that were pre-categorized by human reviewers), but couldn’t do the rest due to time constraints. This article presents the results for all the queries submitted by the audience.
The audience had limited time to construct queries (working together in groups), they weren’t familiar with the data set, and they couldn’t do sampling to tune their queries, so I’m not claiming the exercise was comparable to an e-discovery project. Still, it was entertaining. The topics are pretty simple, so a large percentage of the relevant documents can be found with a pretty simple search using some broad terms. For example, a search for “biology” would find 37% of the biology documents. A search for “law” would find 71% of the law articles. The trick is to find the relevant documents without pulling in too many of the non-relevant ones.
To evaluate the results, I measured the recall (percentage of relevant documents found) from the top 3,000 and top 6,000 hits on the search query (3% and 6% of the population respectively). I’ve also included the recall achieved by looking at all docs that matched the search query, just to see what recall the search queries could achieve if you didn’t worry about pulling in a ton of non-relevant docs. For the TAR results I used TAR 3.0 trained with two seed documents (one relevant from a keyword search and one random non-relevant document) followed by 20 iterations of 10 top-scoring cluster centers, so a total of 202 training documents (no control set needed with TAR 3.0). To compare to the top 3,000 search query matches, the 202 training documents plus 2,798 top-scoring documents were used for TAR, so the total document review (including training) would be the same for TAR and the search query.
The search engine in Clustify is intended to help the user find a few seed documents to get active learning started, so it has some limitations. If the audience’s search query included phrases, they were converted an AND search enclosed in parenthesis. If the audience’s query included a wildcard, I converted it to a parenthesized OR search by looking at the matching words in the index and selecting only the ones that made sense (i.e., I made the queries better than they would have been with an actual wildcard). I noticed that there were a lot of irrelevant words that matched the wildcards. For example, “cell*” in a biology search should match cellphone, cellular, cellar, cellist, etc., but I excluded such words. I would highly recommend that people using keyword search check to see what their wildcards are actually matching–you may be pulling in a lot of irrelevant words. I removed a few words from the queries that weren’t in the index (so the words shown all actually had an impact). When there is an “a” and “b” version of the query, the “a” version was the audience’s query as-is, and the “b” query was tweaked by me to retrieve more documents.
The tables below show the results. The actual queries are displayed below the tables. Discussion of the results is at the end.
Biology | Recall | |||
Query | Total Matches | Top 3,000 | Top 6,000 | All Matches |
1 | 4,407 | 34.0% | 47.2% | 47.2% |
2 | 13,799 | 37.3% | 46.0% | 80.9% |
3 | 25,168 | 44.3% | 60.9% | 87.8% |
4a | 42 | 0.5% | 0.5% | |
4b | 2,283 | 20.9% | 20.9% | |
TAR | 72.1% | 91.0% |
Law | Recall | |||
Query | Total Matches | Top 3,000 | Top 6,000 | All Matches |
5a | 2,914 | 35.8% | 35.8% | |
5b | 9,035 | 37.2% | 49.3% | 60.6% |
6 | 534 | 2.9% | 2.9% | |
7 | 27,288 | 32.3% | 47.1% | 79.1% |
TAR | 62.3% | 80.4% |
1) organism OR microorganism OR species OR DNA
2) habitat OR ecology OR marine OR ecosystem OR biology OR cell OR organism OR species OR photosynthesis OR pollination OR gene OR genetic OR genome AND NOT (treatment OR generic OR prognosis OR placebo OR diagnosis OR FDA OR medical OR medicine OR medication OR medications OR medicines OR medicated OR medicinal OR physician)
3) biology OR plant OR (phyllis OR phylos OR phylogenetic OR phylogeny OR phyllo OR phylis OR phylloxera) OR animal OR (cell OR cells OR celled OR cellomics OR celltiter) OR (circulation OR circulatory) OR (neural OR neuron OR neurotransmitter OR neurotransmitters OR neurological OR neurons OR neurotoxic OR neurobiology OR neuromuscular OR neuroscience OR neurotransmission OR neuropathy OR neurologically OR neuroanatomy OR neuroimaging OR neuronal OR neurosciences OR neuroendocrine OR neurofeedback OR neuroscientist OR neuroscientists OR neurobiologist OR neurochemical OR neuromorphic OR neurohormones OR neuroscientific OR neurovascular OR neurohormonal OR neurotechnology OR neurobiologists OR neurogenetics OR neuropeptide OR neuroreceptors) OR enzyme OR blood OR nerve OR brain OR kidney OR (muscle OR muscles) OR dna OR rna OR species OR mitochondria
4a) statistically AND ((laboratory AND test) OR species OR (genetic AND marker) OR enzyme) AND NOT (diagnosis OR treatment OR prognosis)
4b) (species OR (genetic AND marker) OR enzyme) AND NOT (diagnosis OR treatment OR prognosis)
5a) federal AND (ruling OR judge OR justice OR (appellate OR appellant))
5b) ruling OR judge OR justice OR (appellate OR appellant)
6) amendments OR FRE OR whistleblower
7) ((law OR laws OR lawyer OR lawyers OR lawsuit OR lawsuits OR lawyering) OR (regulation OR regulations) OR (statute OR statutes) OR (standards)) AND NOT pending
TAR beat keyword search across the board for both tasks. The top 3,000 documents returned by TAR achieved higher recall than the top 6,000 documents for any keyword search. In other words, if documents will be reviewed before production, TAR achieves better results (higher recall) with half as much document review compared to any of the keyword searches. The top 6,000 documents returned by TAR achieved higher recall than all of the documents matching any individual keyword search, even when the keyword search returned 27,000 documents.
Similar experiments were performed later with many similarities but also some notable differences in the structure of the challenge and the results. You can read about them here: round 2, round 3, round 4, round 5, and round 6.
Fascinating results, Bill. With TAR 3.0, when you say that there were a total of 202 training documents, were you coding these, or you ‘cherry picked’ the top 10 x 20 + 2?
We use Ringtail, and at times, I like to do something similar (I think) which is to split the corpus by cluster (Ringtail calls this Mines) and then sample within each cluster and then look use CAL from the samples.
My concern is that most of our docs for review are emails, and the emails are always ‘polluted’ with extraneous data (deep email threads where the recipients and disclaimer footers typically have more text than the actual message) and so I’ve always been sceptical of the accuracy of CAL and clustering due to the fact that I feel that the system is being fooled into thinking things are similar when they’re not.
Of course this has nothing to do with your experiment, its my long winded way of saying that I like the idea of your TAR 3.0 model.
With the keyword searches, as you say, I think they would be more accurate if you had visibility to a word wheel where you can ‘road test’ the search term to see what variants would be returned.
Matthew
Hi Matthew,
I wouldn’t say the training documents were cherry picked. Other than the 2 seed documents, which were one random document that matches a keyword search for “biology” or “law” plus one random non-relevant document, the training documents were chosen by the software (not by me). The TAR 3.0 process is like CAL but applied to cluster centers only—you review the 10 unreviewed cluster centers having the highest scores, update predictions, then do it again a total of 20 times.
Clustify has the ability to ignore email header data (even if embedded in the middle of the email due to replies) and footers, so we don’t really run into that problem. I could certainly see that being a problem for clustering. It should be less of a problem for TAR (depending on the algorithm) since the algorithm should (roughly speaking) figure out what parts the relevant documents have in common that the non-relevant docs don’t contain and use those parts as indicators of relevance. Of course, if you are using the TAR 3.0 approach you are relying on the clusters, so you need them to be good.
Yes, humans could probably do better if they could test their queries, but they would probably still lose. Some classification algorithms can generate a keyword search that sorts the documents the same as the TAR relevance score. When you do that (which I did during the talk), you find the TAR query contains hundreds or thousands of terms that are precisely weighted, including both positive and negative weights, to optimally sort the results, even taking word correlations into account (for many algorithms). It seems unlikely that a human would produce a better query.
Thanks Bill. That’s really interesting.