This iteration of the challenge, held at the Education Hub at ILTACON 2018, was structured somewhat differently from round 1 and round 2 to give the audience a better chance of beating TAR. Instead of submitting search queries on paper, participants submitted them through a web form using their phones, which allowed them to repeatedly tweak their queries and resubmit them. I executed the queries in front of the participants, so they could see the exact recall achieved (since all documents were marked as relevant or non-relevant by a human reviewer in advance) almost instantaneously and they could utilize the performance information for their queries and the queries of other participants to guide improvements to their queries. This actually gave the participants an advantage over what they would experience in a real e-discovery project since performance measurements would normally require human evaluation of a random sample from the search output, which would make execution of several iterations of a query guided by performance evaluations very expensive in terms of review labor. The audience got those performance evaluations for free even though the goal was to compare recall achieved for equal amounts of document review effort. On the other hand, the audience did still have the disadvantages of having limited time and no familiarity with the documents.
As before, recall was evaluated for the top 3000 and top 6000 documents, which was enough to achieve high recall with TAR (even with the training documents included, so total review effort for TAR and the search queries was the same). Audience members were free to work on any of the three topics that were used in previous versions of the challenge: law, medical industry, or biology. Unfortunately, the audience was much smaller than previous versions of the challenge, and nobody chose to submit a query for the biology topic.
Previously, the TAR results were achieved by using the TAR 3.0 workflow to train with 200 cluster centers, documents were sorted based on the resulting relevance scores, and top-scoring documents were reviewed until the desired amount of review effort was expended without allowing predictions to be updated during that review (e.g., review of 200 training docs plus 2,800 top scoring docs to get the “Top 3,000” result). I’ll call this TAR 3.0 SAL (SAL = Simple Active Learning, meaning the system is not allowed to learn during the review of top-scoring documents). In practice you wouldn’t do that. If you were reviewing top-scoring documents, you would allow the system to continue learning (CAL). You would use SAL only if you were producing top-scoring documents without reviewing them since allowing learning to continue during the review would reduce the amount of review needed to achieve a desired level of recall. I used TAR 3.0 SAL in previous iterations because I wanted to simulate the full review in front of the audience in a few seconds and TAR 3.0 CAL would have been slower. This time, I did the TAR calculations in advance and present both the SAL and CAL results so you can see how much difference the additional learning from CAL made.
One other difference compared to previous versions of the challenge is how I’ve labeled the queries below. This time, the number indicates which participant submitted the query and the letter indicates which one of his/her queries are being analyzed (if the person submitted more than one) rather than indicating a tweaking of the query that I added to try to improve the result. In other words, all variations were tweaks done by the audience instead of by me. Discussion of the results follows the tables, graphs, and queries below.
Recall | ||
Medical Industry | Top 3,000 | Top 6,000 |
1a | 3.0% | |
1b | 17.4% | |
TAR 3.0 SAL | 67.3% | 83.7% |
TAR 3.0 CAL | 80.7% | 88.5% |
Recall | ||
Law | Top 3,000 | Top 6,000 |
2 | 1.0% | |
3a | 36.1% | 42.3% |
3b | 45.3% | 60.1% |
3c | 47.2% | 62.6% |
4 | 11.6% | 13.8% |
TAR 3.0 SAL | 63.5% | 82.3% |
TAR 3.0 CAL | 77.8% | 87.8% |
1a) Hospital AND New AND therapies
1b) Hospital AND New AND (physicians OR doctors)
2) Copyright AND mickey AND mouse
3a) Schedule OR Amendments OR Trial OR Jury OR Judge OR Circuit OR Courtroom OR Judgement
3b) Amendments OR Trial OR Jury OR Judge OR Circuit OR Courtroom OR Judgement OR trial OR law OR Patent OR legal
3c) Amendments OR Trial OR Jury OR Judge OR Circuit OR Courtroom OR Judgement OR trial OR law OR Patent OR legal OR Plaintiff OR Defendant
4) Privacy OR (Personally AND Identifiable AND Information) OR PII OR (Protected AND Speech)
TAR won across the board, as in previous iterations of the challenge. Only one person submitted queries for the medical industry topic. His/her revised query did a better job of finding relevant documents, but still returned fewer than 3,000 documents and fared far worse than TAR — the query was just not broad enough to achieve high recall. Three people submitted queries on the law topic. One of those people revised the query a few times and got decent results (shown in green), but still fell far short of the TAR result, with review of 6,000 documents from the best query finding fewer relevant documents than review of half as many documents with TAR 3.0 SAL (TAR 3.0 CAL did even better). It is unfortunate that the audience was so small, since a larger audience might have done better by learning from each other’s submissions. Hopefully I’ll be able to do this with a bigger audience in the future.