TAR vs. Keyword Search Challenge, Round 4

This iteration of the challenge was performed during the Digging into TAR session at the 2018 Northeast eDiscovery & IG Retreat.  The structure was similar to round 3, but the audience was bigger.  As before, the goal was to see whether the audience could construct a keyword search query that performed better than technology-assisted review.

There are two sensible ways to compare performance.  Either see which approach reaches a fixed level of recall with the least review effort, or see which approach reaches the highest level of recall with a fixed amount of review effort.  Any approach comparing results having different recall and different review effort cannot give a definitive conclusion on which result is best without making arbitrary assumptions about a trade off between recall and effort (this is why performance measures, such as the F1 score, that mix recall and precision together are not sensible for ediscovery).

For the challenge we fixed the amount of review effort and measured the recall achieved, because that was an easier process to carry out under the circumstances.  Specifically, we took the top 3,000 documents matching the search query, reviewed them (this was instantaneous because the whole population was reviewed in advance), and measured the recall achieved.  That was compared to the recall for a TAR 3.0 process where 200 cluster centers were reviewed for training and then the top-scoring 2,800 documents were reviewed.  If the system was allowed to continue learning while the top-scoring documents were reviewed, the result was called “TAR 3.0 CAL.”  If learning was terminated after review of the 200 cluster centers, the result was called “TAR 3.0 SAL.”  The process was repeated with 6,000 documents instead of 3,000 so you can see how much recall improves if you double the review effort.

Individuals in the audience submitted queries through a web form using smart phones or laptops and I executed some (due to limited time) of the queries in front of the audience.  They could learn useful keywords from the documents matching the queries and tweak their queries and resubmit them.  Unlike a real ediscovery project, they had very limited time and no familiarity with the documents.  The audience could choose to work on any of three topics: biology, medical industry, or law.  In the results below, the queries are labeled with the submitters’ initials (some people gave only a first name, so there is only one initial) followed by a number if they submitted more than one query.  Two queries were omitted because they had less than 1% recall (the participants apparently misunderstood the task).  The queries that were evaluated in front of the audience were E-1, U, AC-1, and JM-1.  The discussion of the result follows the tables, graphs, and queries.

Biology Recall
Query Top 3,000 Top 6,000
E-1 32.0% 49.9%
E-2 51.7% 60.4%
E-3 48.4% 57.6%
E-4 45.8% 60.7%
E-5 43.3% 54.0%
E-6 42.7% 57.2%
TAR 3.0 SAL 72.5% 91.0%
TAR 3.0 CAL 75.5% 93.0%
Medical Recall
Query Top 3,000 Top 6,000
U 17.1% 27.9%
TAR 3.0 SAL 67.3% 83.7%
TAR 3.0 CAL 80.7% 88.5%
Law Recall
Query Top 3,000 Top 6,000
AC-1 16.4% 33.2%
AC-2 40.7% 54.4%
JM-1 49.4% 69.3%
JM-2 55.9% 76.4%
K-1 43.5% 60.6%
K-2 43.0% 62.6%
C 32.9% 47.2%
R 55.6% 76.6%
TAR 3.0 SAL 63.5% 82.3%
TAR 3.0 CAL 77.8% 87.8%

tar_vs_search4_biology

tar_vs_search4_medical

tar_vs_search4_law

E-1) biology OR microbiology OR chemical OR pharmacodynamic OR pharmacokinetic
E-2) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence
E-3) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence OR pharmacokinetic OR processes OR lysis
E-4) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence OR pharmacokinetic OR processes OR lysis OR study
E-5) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence OR pharmacokinetic OR processes OR lysis OR study OR table
E-6) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence OR pharmacokinetic OR processes OR lysis OR study OR table OR research
U) Transplant OR organ OR cancer OR hypothesis
AC-1) law
AC-2) legal OR attorney OR (defendant AND plaintiff) OR precedent OR verdict OR deliberate OR motion OR dismissed OR granted
JM-1) Law OR legal OR attorney OR lawyer OR litigation OR liability OR lawsuit OR judge
JM-2) Law OR legal OR attorney OR lawyer OR litigation OR liability OR lawsuit OR judge OR defendant OR plaintiff OR court OR plaintiffs OR attorneys OR lawyers OR defense
K-1) Law OR lawyer OR attorney OR advice OR litigation OR court OR investigation OR subpoena
K-2) Law OR lawyer OR attorney OR advice OR litigation OR court OR investigation OR subpoena OR justice
C) (law OR legal OR criminal OR civil OR litigation) AND NOT (politics OR proposed OR pending)
R) Court OR courtroom OR judge OR judicial OR judiciary OR law OR lawyer OR legal OR plaintiff OR plaintiffs OR defendant OR defendants OR subpoena OR sued OR suing OR sue OR lawsuit OR injunction OR justice

None of the keyword searches achieved higher recall than TAR when the amount of review effort was equal.  All six of the biology queries were submitted by one person.  The first query was evaluated in front of the audience, and his first revision to the query did help, but subsequent (blind) revisions of the query tended to hurt more than they helped.  For biology, review of 3,000 documents with TAR gave better recall than review of 6,000 documents with any of the queries.  There was only a single query submitted for the medical industry, and it underperformed TAR substantially.  Five people submitted a total of eight queries for the law category, and the audience had the best results for that topic, which isn’t surprising since an audience full of lawyers and litigation support people would be expected to be especially good at identifying keywords related to the law.  Even the best queries had lower recall with review of 6,000 documents than TAR 3.0 CAL achieved with review of only 3,000 documents, but a few of the queries did achieve higher recall than TAR 3.0 SAL when twice as much document review was performed with the search query compared to TAR 3.0 SAL.

1 thought on “TAR vs. Keyword Search Challenge, Round 4

  1. Karen Brenchley

    I think this would be a more interesting challenge if you gave the audience five documents in each area to look at before they came up with keywords.

    Reply

Leave a Reply to Karen BrenchleyCancel reply