TAR vs. Keyword Search Challenge, Round 2

During my presentation at the South Central eDiscovery & IG Retreat I challenged the audience to create keyword searches that would work better than technology-assisted review (predictive coding).  This is similar to the experiment done a few months earlier.  See this article for more details.  The audience again worked in groups to construct keyword searches for two topics.  One topic, articles on law, was the same as last time.  The other topic, the medical industry, was new (it replaced biology).

Performance was evaluated by comparing the recall achieved for equal amounts of document review effort (the population was fully categorized in advance, so measurements are exact, not estimates).  Recall for the top 3000 keyword search matches was compared to recall from reviewing 202 training documents (2 seed documents plus 200 cluster centers using the TAR 3.0 method) and 2798 documents having the highest relevance scores from TAR.  Similarly, recall from the top 6000 keyword search matches was compared to recall from review of 6000 documents with TAR.  Recall from all documents matching a search query was also measured to find the maximum recall that could be achieved with the query.

The search queries are shown after the performance tables and graphs.  When there is an “a” and “b” version of the query, the “a” version was the audience’s query as-is, and the “b” query was tweaked by me to remove restrictions that were limiting the number of relevant documents that could be found.  The results are discussed at the end of the article.

Medical Industry Recall
Query Total Matches Top 3,000 Top 6,000 All
1a 1,618 14.4% 14.4%
1b 3,882 32.4% 40.6% 40.6%
2 7,684 30.3% 42.2% 46.6%
3a 1,714 22.4% 22.4%
3b 16,756 32.7% 44.6% 71.1%
4a 33,925 15.3% 20.3% 35.2%
4b 58,510 27.9% 40.6% 94.5%
TAR 67.3% 83.7%

 

Law Recall
Query Total Matches Top 3,000 Top 6,000 All
5 36,245 38.8% 56.4% 92.3%
6 25,370 51.9% 72.4% 95.7%
TAR 63.5% 82.3%

tar_vs_search2_medical

tar_vs_search2_law

 

1a) medical AND (industry OR business) AND NOT (scientific OR research)
1b) medical AND (industry OR business)
2) (revenue OR finance OR market OR brand OR sales) AND (hospital OR health OR medical OR clinical)
3a) (medical OR hospital OR doctor) AND (HIPPA OR insurance)
3b) medical OR hospital OR doctor OR HIPPA OR insurance
4a) (earnings OR profits OR management OR executive OR recall OR (board AND directors) OR healthcare OR medical OR health OR hospital OR physician OR nurse OR marketing OR pharma OR report OR GlaxoSmithKline OR (united AND health) OR AstraZeneca OR Gilead OR Sanofi OR financial OR malpractice OR (annual AND report) OR provider OR HMO OR PPO OR telemedicine) AND NOT (study OR research OR academic)
4b) earnings OR profits OR management OR executive OR recall OR (board AND directors) OR healthcare OR medical OR health OR hospital OR physician OR nurse OR marketing OR pharma OR report OR GlaxoSmithKline OR (united AND health) OR AstraZeneca OR Gilead OR Sanofi OR financial OR malpractice OR (annual AND report) OR provider OR HMO OR PPO OR telemedicine
5) FRCP OR Fed OR litigation OR appeal OR immigration OR ordinance OR legal OR law OR enact OR code OR statute OR subsection OR regulation OR rules OR precedent OR (applicable AND law) OR ruling
6) judge OR (supreme AND court) OR court OR legislation OR legal OR lawyer OR judicial OR law OR attorney

As before, TAR won across the board, but there were some surprises this time.

For the medical industry topic, review of 3000 documents with TAR achieved higher recall than any keyword search achieved with review of 6000 documents, very similar to results from a few months ago.  When all documents matching the medical industry search queries were analyzed, two queries did achieve high recall (3b and 4b, which are queries I tweaked to achieve higher recall), but they did so by retrieving a substantial percentage of the 100,000 document population (16,756 and 58,510 documents respectively).  TAR can reach any level of recall by simply taking enough documents from the sorted list—TAR doesn’t run out of matches like a keyword search does.  TAR matches the 94.6% recall that query 4b achieved (requiring review of 58,510 documents) with review of only 15,500 documents.

Results for the law topic were more interesting.  The two queries submitted for the law topic both performed better than any of the queries submitted for that topic a few months ago.  Query 6 gave the best results, with TAR beating it by only a modest amount.  If all 25,370 documents matching query 6 were reviewed, 95.7% recall would be achieved, which TAR could accomplish with review of 24,000 documents.  It is worth noting that TAR 2.0 would be more efficient, especially at very high recall.  TAR 3.0 gives the option to produce documents without review (not utilized for this exercise), plus computations are much faster due to there being vastly fewer training documents, which is handy for simulating a full review live in front of an audience in a few seconds.

3 thoughts on “TAR vs. Keyword Search Challenge, Round 2

  1. Matthew

    Interesting as always, Bill.

    Out of curiosity for the workshopping on the keyword searches, I wonder if the audience had a word wheel from distinct words from the corpus, or even better if they had words from clusters, how effective that would be.

    My point being that I think there is still merit in keyword filtering where you are using keywords from the corpus as opposed to being given instructions before you have access to the documents and ‘dreaming up’ keywords.

    Secondly, I’m surprised that proximity search operators weren’t used ie A within N words of B.

    Similar in approach to rounds of predictive coding, we at times an approach that we take is to iterate refined keywords as you identify false positives.

    I take it also that the corpus were documents as opposed to emails? Emails can be a bit difficult to work with – particularly with long threads with lots and lots of recipients and lots of disclaimer footers.

    Matthew

    Reply
    1. Bill Dimm Post author

      Hi Matthew,

      The audience didn’t have access to anything like cluster data. I’ll be doing the experiment at ILTACON (Education Hub, 12:40 on Tuesday, 8/21) and I’m thinking about how to give the audience more tools and the ability to iteratively improve their queries, but I don’t yet know what will be possible technology-wise and time will be limited.

      Proximity search wasn’t available. The search engine used is very limited because it is only intended to be used for finding a few seed documents to get training started (no point in bloating the index with proximity data for that). On the other hand, TAR is generating a keyword search that has the same limitations the audience has (it does a big OR query with keyword weights), so it is fair in a sense.

      The documents were magazine articles, not emails.

      Reply

Leave a Reply