During my presentation at the NorCal eDiscovery & IG Retreat I challenged the audience to create keyword searches that would work better than technology-assisted review (predictive coding) for two topics. Half of the room was tasked with finding articles about biology (science-oriented articles, excluding medical treatment) and the other half searched for articles about current law (excluding proposed laws or politics). I ran one of the searches against TAR in Clustify live during the presentation (Clustify’s “shadow tags” feature allows a full document review to be simulated in a few minutes using documents that were pre-categorized by human reviewers), but couldn’t do the rest due to time constraints. This article presents the results for all the queries submitted by the audience.
The audience had limited time to construct queries (working together in groups), they weren’t familiar with the data set, and they couldn’t do sampling to tune their queries, so I’m not claiming the exercise was comparable to an e-discovery project. Still, it was entertaining. The topics are pretty simple, so a large percentage of the relevant documents can be found with a pretty simple search using some broad terms. For example, a search for “biology” would find 37% of the biology documents. A search for “law” would find 71% of the law articles. The trick is to find the relevant documents without pulling in too many of the non-relevant ones.
To evaluate the results, I measured the recall (percentage of relevant documents found) from the top 3,000 and top 6,000 hits on the search query (3% and 6% of the population respectively). I’ve also included the recall achieved by looking at all docs that matched the search query, just to see what recall the search queries could achieve if you didn’t worry about pulling in a ton of non-relevant docs. For the TAR results I used TAR 3.0 trained with two seed documents (one relevant from a keyword search and one random non-relevant document) followed by 20 iterations of 10 top-scoring cluster centers, so a total of 202 training documents (no control set needed with TAR 3.0). To compare to the top 3,000 search query matches, the 202 training documents plus 2,798 top-scoring documents were used for TAR, so the total document review (including training) would be the same for TAR and the search query.
The search engine in Clustify is intended to help the user find a few seed documents to get active learning started, so it has some limitations. If the audience’s search query included phrases, they were converted an AND search enclosed in parenthesis. If the audience’s query included a wildcard, I converted it to a parenthesized OR search by looking at the matching words in the index and selecting only the ones that made sense (i.e., I made the queries better than they would have been with an actual wildcard). I noticed that there were a lot of irrelevant words that matched the wildcards. For example, “cell*” in a biology search should match cellphone, cellular, cellar, cellist, etc., but I excluded such words. I would highly recommend that people using keyword search check to see what their wildcards are actually matching–you may be pulling in a lot of irrelevant words. I removed a few words from the queries that weren’t in the index (so the words shown all actually had an impact). When there is an “a” and “b” version of the query, the “a” version was the audience’s query as-is, and the “b” query was tweaked by me to retrieve more documents.
The tables below show the results. The actual queries are displayed below the tables. Discussion of the results is at the end.
Biology | Recall | |||
Query | Total Matches | Top 3,000 | Top 6,000 | All Matches |
1 | 4,407 | 34.0% | 47.2% | 47.2% |
2 | 13,799 | 37.3% | 46.0% | 80.9% |
3 | 25,168 | 44.3% | 60.9% | 87.8% |
4a | 42 | 0.5% | 0.5% | |
4b | 2,283 | 20.9% | 20.9% | |
TAR | 72.1% | 91.0% |
Law | Recall | |||
Query | Total Matches | Top 3,000 | Top 6,000 | All Matches |
5a | 2,914 | 35.8% | 35.8% | |
5b | 9,035 | 37.2% | 49.3% | 60.6% |
6 | 534 | 2.9% | 2.9% | |
7 | 27,288 | 32.3% | 47.1% | 79.1% |
TAR | 62.3% | 80.4% |
1) organism OR microorganism OR species OR DNA
2) habitat OR ecology OR marine OR ecosystem OR biology OR cell OR organism OR species OR photosynthesis OR pollination OR gene OR genetic OR genome AND NOT (treatment OR generic OR prognosis OR placebo OR diagnosis OR FDA OR medical OR medicine OR medication OR medications OR medicines OR medicated OR medicinal OR physician)
3) biology OR plant OR (phyllis OR phylos OR phylogenetic OR phylogeny OR phyllo OR phylis OR phylloxera) OR animal OR (cell OR cells OR celled OR cellomics OR celltiter) OR (circulation OR circulatory) OR (neural OR neuron OR neurotransmitter OR neurotransmitters OR neurological OR neurons OR neurotoxic OR neurobiology OR neuromuscular OR neuroscience OR neurotransmission OR neuropathy OR neurologically OR neuroanatomy OR neuroimaging OR neuronal OR neurosciences OR neuroendocrine OR neurofeedback OR neuroscientist OR neuroscientists OR neurobiologist OR neurochemical OR neuromorphic OR neurohormones OR neuroscientific OR neurovascular OR neurohormonal OR neurotechnology OR neurobiologists OR neurogenetics OR neuropeptide OR neuroreceptors) OR enzyme OR blood OR nerve OR brain OR kidney OR (muscle OR muscles) OR dna OR rna OR species OR mitochondria
4a) statistically AND ((laboratory AND test) OR species OR (genetic AND marker) OR enzyme) AND NOT (diagnosis OR treatment OR prognosis)
4b) (species OR (genetic AND marker) OR enzyme) AND NOT (diagnosis OR treatment OR prognosis)
5a) federal AND (ruling OR judge OR justice OR (appellate OR appellant))
5b) ruling OR judge OR justice OR (appellate OR appellant)
6) amendments OR FRE OR whistleblower
7) ((law OR laws OR lawyer OR lawyers OR lawsuit OR lawsuits OR lawyering) OR (regulation OR regulations) OR (statute OR statutes) OR (standards)) AND NOT pending
TAR beat keyword search across the board for both tasks. The top 3,000 documents returned by TAR achieved higher recall than the top 6,000 documents for any keyword search. In other words, if documents will be reviewed before production, TAR achieves better results (higher recall) with half as much document review compared to any of the keyword searches. The top 6,000 documents returned by TAR achieved higher recall than all of the documents matching any individual keyword search, even when the keyword search returned 27,000 documents.
Similar experiments were performed later with many similarities but also some notable differences in the structure of the challenge and the results. You can read about them here: round 2, round 3, round 4, round 5, and round 6.