Tag Archives: technology-assisted review

TAR vs. Keyword Search Challenge, Round 6 (Instant Feedback)

This was by far the most significant iteration of the ongoing exercise where I challenge an audience to produce a keyword search that works better than technology-assisted review (also known as predictive coding or supervised machine learning).  There were far more participants than previous rounds, and a structural change in the challenge allowed participants to get immediate feedback on the performance of their queries so they could iteratively improve them.  A total of 1,924 queries were submitted by 42 participants (an average of 45.8 queries per person) and higher recall levels were achieved than in any prior version of the challenge, but the audience still couldn’t beat TAR.

In previous versions of the experiment, the audience submitted search queries on paper or through a web form using their phones, and I evaluated a few of them live on stage to see whether the audience was able to achieve higher recall than TAR.  Because the number of live evaluations was so small, the audience had very little opportunity to use the results to improve their queries.  In the latest iteration, participants each had their own computer in the lab at the 2019 Ipro Tech Show, and the web form evaluated the query and gave the user feedback on the recall achieved immediately.  Furthermore, it displayed the relevance and important keywords for each of the top 100 documents matching the query, so participants could quickly discover useful new search terms to tweak their queries.  This gave participants a significant advantage over a normal e-discovery scenario, since they could try an unlimited number of queries without incurring any cost to make relevance determinations on the retrieved documents in order to decide which keywords would improve the queries.  The number of participants was significantly larger than any of the previous iterations, and they had a full 20 minutes to try as many queries as they wanted.  It was the best chance an audience has ever had of beating TAR.  They failed.

To do a fair comparison between TAR and the keyword search results, recall values were compared for equal amounts of document review effort.  In other words, for a specified amount of human labor, which approach gave the best production?  For the search queries, the top 3,000 documents matching the query were evaluated to determine the number that were relevant so recall could be computed (the full population was reviewed in advance, so the relevance of all documents was known). That was compared to the recall for a TAR 3.0 process where 200 cluster centers were reviewed for training and then the top-scoring 2,800 documents were reviewed.  If the system was allowed to continue learning while the top-scoring documents were reviewed, the result was called “TAR 3.0 CAL.”  If learning was terminated after review of the 200 cluster centers, the result was called “TAR 3.0 SAL.”  The process was repeated with review of 6,000 documents instead of 3,000 so you can see how much recall improves if you double the review effort.  Participants could choose to submit queries for any of three topics: biology, medical industry, or law.

The results below labeled “Avg Participant” are computed by finding the highest recall achieved by each participant and averaging those values together.  These are surely somewhat inflated values since one would probably not go through so many iterations of honing the queries in practice (especially since evaluating the efficacy of a query would normally involve considerable labor instead of being free and instantaneous), but I wanted to give the participants as much advantage as I could and including all of the queries instead of just the best ones would have biased the results to be too low due to people making mistakes or experimenting with bad queries just to explore the documents.  The results labeled “Best Participant” show the highest recall achieved by any participant (computed separately for Top 3,000 and Top 6,000, so they may be different queries).

Biology Recall
Top 3,000 Top 6,000
Avg Participant 54.5 69.5
Best Participant 66.0 83.2
TAR 3.0 SAL 72.5 91.0
TAR 3.0 CAL 75.5 93.0
Medical Recall
Top 3,000 Top 6,000
Avg Participant 38.5 51.8
Best Participant 46.8 64.0
TAR 3.0 SAL 67.3 83.7
TAR 3.0 CAL 80.7 88.5
Law Recall
Top 3,000 Top 6,000
Avg Participant 43.1 59.3
Best Participant 60.5 77.8
TAR 3.0 SAL 63.5 82.3
TAR 3.0 CAL 77.8 87.8

As you can see from the tables above, the best result for any participant never beat TAR (SAL or CAL) when there was an equal amount of document review performed.  Furthermore, the average participant result for Top 6,000 never beat the TAR results for Top 3,000, though the best participant result sometimes did, so TAR typically gives a better result even with half as much review effort expended.  The graphs below show the best results for each participant compared to TAR in blue.  The numbers in the legend are the ID numbers of the participants (the color for a particular participant is not consistent across topics).  Click the graph to see a larger version.

bar_graph_bio

bar_graph_medical

bar_graph_law

The large number of people attempting the biology topic was probably due to it being the default, and I illustrated how to use the software with that topic.

One might wonder whether the participants could have done better if they had more than 20 minutes to work on their queries.  The graphs below show the highest recall achieved by any participant as a function of time.  You can see that results improved rapidly during the first 10 minutes, but it became hard to make much additional progress beyond that point.  Also, over half of the audience continued to submit queries after the 20 minute contest, while I was giving the remainder of the presentation.  40% of the queries were submitted during the first 10 minutes, 40% were submitted during the second 10 minutes, and 20% were submitted while I was talking.  Since there were roughly the same number of queries submitted in the second 10 minutes as the first 10 minutes, but much less progress was made, I think it is safe to say that time was not a big factor in the results.

time_bio

time_medical

time_law

In summary, even with a large pool of participants, ample time, and the ability to hone search queries based on instant feedback, nobody was able to generate a better production than TAR when the same amount of review effort was expended.  It seems fair to say that keyword search often requires twice as much document review to achieve a production that is as good as what you would get TAR.

 

 

TAR vs. Keyword Search Challenge, Round 5

The audience was challenged to construct a keyword search query that is more effective than technology-assisted review (TAR) at IG3 West 2018.  The procedure was the same as round 4, so I won’t repeat the details here.  The audience was small this time and we only got one query submission for each topic.  The submission for the law topic used AND to join the keywords together and matched no articles, so I changed the ANDs to ORs before evaluating it.  The results and queries are below.  TAR beat the keyword searches by a huge margin this time.

Biology Recall
Query Top 3,000 Top 6,000
Search 20.1% 20.1%
TAR 3.0 SAL 72.5% 91.0%
TAR 3.0 CAL 75.5% 93.0%
Medical Recall
Query Top 3,000 Top 6,000
Search 28.5% 38.1%
TAR 3.0 SAL 67.3% 83.7%
TAR 3.0 CAL 80.7% 88.5%
Law Recall
Query Top 3,000 Top 6,000
Search 5.5% 9.4%
TAR 3.0 SAL 63.5% 82.3%
TAR 3.0 CAL 77.8% 87.8%

tar_vs_search5_biology

tar_vs_search5_medical

tar_vs_search5_law

biology query: (Evolution OR develop) AND (Darwin OR bird OR cell)
medical query: Human OR body OR medicine OR insurance OR license OR doctor OR patient
law query: securities OR conspiracy OR RICO OR insider

Highlights from IG3 West 2018

The IG3 West conference was held by Ing3nious at the Paséa Hotel & Spa in Huntington Beach, California. ig3west2018_hotel This conference differed from other recent Ing3nious events in several ways.  It was two days of presentations instead of one.  There were three simultaneous panels instead of two.  Between panels there were sometimes three simultaneous vendor technology demos.  There was an exhibit hall with over forty vendor tables.  Due to the different format, I was only able to attend about a third of the presentations.  My notes are below.  You can find my full set of photos here.

Stop Chasing Horses, Start Building Fences: How Real-Time Technologies Change the Game of Compliance and Governance
Chris Surdak, the author of Jerk:  Twelve Steps to Rule the World, talked about changing technology and the value of information, claiming that information is the new wealth.  Facebook, Amazon, Apple, Netflix, and Google together are worth more than France [apparently he means the sum of their market capitalizations  is greater than the GDP of France, though that is a rather apples-to-oranges comparison since GDP is an annualized number].  We are exposed to persistent ambient surveillance (Alexa, Siri, Progressive Snapshot, etc.).  It is possible to detect whether someone is lying by using video to detect blood flow to their face.  Car companies monetized data about passengers’ weight (measured due to air bags). ig3west2018_keynote Sentiment analysis has a hard time with sarcasm.  You can’t find emails about fraud by searching for “fraud” — discussions about fraudulent activity may be disguised as weirdly specific conversations about lunch.  The problem with graph analysis is that a large volume of talk about something doesn’t mean that it’s important.  The most important thing may be what’s missing.  When RadioShack went bankrupt, its remaining value was in its customer data — remember them asking for your contact info when you bought batteries?  A one-word change to FRCP 37(e) should have changed corporate retention policies, but nobody changed.  The EU’s right to be forgotten is virtually impossible to implement in reality (how to deal with backup tapes?) and almost nobody does it.  Campbell’s has people shipping their DNA to them so they can make diet recommendations to them.  With the GDPR, consent nullifies the protections, so it doesn’t really protect your privacy.

AI and the Corporate Law Department of the Future
Gartner says AI is at the peak of inflated expectations and a trough of disillusionment will follow.  Expect to be able to buy autonomous vehicles by 2023.  The economic downturn of 2008 caused law firms to start using metrics.  Legal will take a long time to adopt AI — managing partners still have assistants print stuff out.  Embracing AI puts a firm ahead of its competitors.  Ethical obligations are also an impediment to adoption of technology, since lawyers are concerned about understanding the result.

Advanced TAR Considerations: A 500 Level Crash Course
Continuous Active Learning (CAL), also called TAR 2.0, can adapt to shifts in the concept of relevance that may occur during the review.  There doesn’t seem to be much difference in the efficiency of SVM vs logistic regression when they are applied to the same task.  There can be a big efficiency difference between different tasks.  TAR 1.0 requires a subject-matter expert for training, but senior attorneys are not always readily available.  With TAR 1.0 you may be concerned that you will be required to disclose the training set (including non-responsive documents), but with TAR 2.0 there is case law that supports that being unnecessary [I’ve seen the argument that the production itself is the training set, but that neglects the non-responsive documents that were reviewed (and used for training) but not produced.  On the other hand, if you are taking about disclosing just the seed set that was used to start the process, that can be a single document and it has very little impact on the result.].  Case law can be found at predictivecoding.com, which is updated at the end of each year.  TAR needs text, not image data.  Sometimes keywords are good enough.  When it comes to government investigations, many agencies (FTC, DOJ) use/accept TAR.  It really depends on the individual investigator, though, and you can’t fight their decision (the investigator is the judge).  Don’t use TAR for government investigations without disclosing that you are doing so.  TAR can have trouble if there are documents having high conceptual similarity where some are relevant and some aren’t.  Should you tell opposing counsel that you’re using TAR?  Usually, but it depends on the situation.  When the situation is symmetrical, both sides tend to be reasonable.  When it is asymmetrical, the side with very little data may try to make things expensive for the other side, so say something like “both sides may use advanced technology to produce documents” and don’t give more detail than that (e.g., how TAR will be trained, who will do the training, etc.) or you may invite problems.  Disclosing the use of TAR up front and getting agreement may avoid problems later.  Be careful about “untrainable documents” (documents containing too little text) — separate them out, and maybe use meta data or file type to help analyze them.  Elusion testing can be used to make sure too many relevant documents weren’t missed.  One panelist said 384 documents could be sampled from the elusion set, though that may sometimes not be enough.  [I have to eat some crow here.  I raised my hand and pointed out that the margin of error for the elusion has to be divided by the prevalence to get the margin of error for the recall, which is correct.  I went on to say that with a sample of 384 giving ±5% for the elusion you would have ±50% for the recall if prevalence was 10%, making the measurement worthless.  The mistake is that while a sample of 384 technically implies a worst case of ±5% for the margin of error for elusion, it’s not realistic for the margin of error to be that bad for elusion because ±5% would occur if elusion was near 50%, but elusion is typically very small (smaller than the prevalence), causing the margin of error for the elusion to be significantly less than ±5%.  The correct margin of error for the recall from an elusion sample of 384 documents would be ±13% if the prevalence is 10%, and ±40% if the prevalence is 1%.  So, if prevalence is around 10% an elusion sample of 384 isn’t completely worthless (though it is much worse than the ±5% we usually aim for), but if prevalence is much lower than that it would be].

40 Years in 30 Minutes: The Background to Some of the Interesting Issues we Face
Steven Brower talked about the early days of the Internet and the current state of technology. ig3west2018_reception1 Early on, a user ID was used to tell who you were, not to keep you out.  Technology was elitist, and user-friendly was not a goal.  Now, so much is locked down for security reasons that things become unusable.  Law firms that prohibit access to social media force lawyers onto “secret” computers when a client needs something taken down from YouTube.  Emails about laws against certain things can be blocked due to keyword hits for the illegal things being described.  We don’t have real AI yet.  The next generation beyond predictive coding will be able to identify the 50 key documents for the case.  During e-discovery, try searching for obscenities to find things like: “I don’t give a f*** what the contract says.”  Autonomous vehicles won’t come as soon as people are predicting.  Snow is a problem for them.  We may get vehicles that drive autonomously from one parking lot to another, so the route is well known.  When there are a bunch of inebriated people in the car, who should it take commands from?  GDPR is silly since email bounces from computer to computer around the world.  The Starwood breach does not mean you need to get a new passport — your passport number was already out there.  To improve your security, don’t try to educate everyone about cybersecurity — you can eliminate half the risk by getting payroll to stop responding to emails asking for W2 data that appear to come from the CEO.  Scammers use the W2 data to file tax returns to get the refunds.  This is so common the IRS won’t even accept reports on it anymore.  You will still get your refund if it happens to you, but it’s a hassle.

Digging Into TAR
I moderated this panel, so I didn’t take notes.  We did the TAR vs. Keyword Search Challenge again.  The results are available here.

After the Incident: Investigating and Responding to a Data Breach
Plan in advance, and remember that you may not have access to the laptop containing the plan when there is a breach. Get a PR firm that handles crises in advance.  You need to be ready for the negative comments on Twitter and Facebook.  Have the right SMEs for the incident on the team.  Assume that everything is discoverable — attorney-client privilege won’t save you if you ask the attorney for business (rather than legal) advice.  Notification laws vary from state to state.  An investigation by law enforcement may require not notifying the public for some period of time.  You should do an annual review of your cyber insurance since things are changing rapidly.  Such policies are industry specific.

Employing Technology/Next-Gen Tools to Reduce eDiscovery Spend
Have a process, but also think about what you are doing and the specifics of the case.  Restrict the date range if possible.  Reuse the results when you have overlapping cases (e.g., privilege review).  Don’t just look at docs/hour when monitoring the review.  Look at accuracy and get feedback about what they are finding.  CAL tends to result in doing too much document review (want to stop at 75% recall but end up hitting 89%).  Using a tool to do redactions will give false positives, so you need manual QC of the result.  When replacing a patient ID with a consistent anonymized identifier, you can’t just transform the ID because that could be inverted, resulting in a HIPAA violation.

eDiscovery for the Rest of us
What are ediscovery considerations for relatively small data sets?  During meet and confer, try to cooperate.  Judges hate ediscovery disputes.  Let the paralegals hash out the details — attorneys don’t really care about the details as long as it works.  Remote collection can avoid travel costs and hourly fees while keeping strangers out of the client’s office.  The biggest thing they look for from vendors is cost.  Need a certain volume of data for TAR to be practical.  Email threading can be used at any size.

Does Compliance Stifle or Spark Innovation?
Startups tend to be full of people fleeing big corporations to get away from compliance requirements. ig3west2018_reception2 If you do compliance well, that can be an advantage over competitors.  Look at it as protecting the longevity of the business (protecting reputation, etc.).  At the DoD, compliance stifles innovation, but it creates a barrier against bad guys.  They have thousands of attacks per day and are about 8 years behind normal innovation.  Gray crimes are a area for innovation — examples include manipulation (influencing elections) and tanking a stock IPO by faking a poisoning.  Hospitals and law firms tend to pay, so they are prime targets for ransomware.

Panels That I Couldn’t Attend:
California and EU Privacy Compliance
What it all Comes Down to – Enterprise Cybersecurity Governance
Selecting eDiscovery Platforms and Vendors
Defensible Disposition of Data
Biometrics and the Evolving Legal Landscape
Storytelling in the Age of eDiscovery
Technology Solution Update From Corporate, Law Firm and Service Provider Perspective
The Internet of Things and Everything as a Service – the Convergence of Security, Privacy and Product Liability
Similarities and Differences Between the GDPR and the New California Consumer Privacy Act – Similar Enough?
The Impact of the Internet of Things on eDiscovery
Escalating Cyber Risk From the IT Department to the Boardroom
So you Weren’t Quite Ready for GDPR?
Security vs. Compliance and Why Legal Frameworks Fall Short to Improve Information Security
How to Clean up Files for Governance and GDPR
Deception, Active Defense and Offensive Security…How to Fight Back Without Breaking the Law?
Information Governance – Separating the “Junk” from the “Jewels”
What are Big Law Firms Saying About Their LegalTech Adoption Opportunities and Challenges?
Cyber and Data Security for the GC: How to Stay out of Headlines and Crosshairs

Podcast: Can You Do Good TAR with a Bad Algorithm?

Bill Dimm will be speaking with John Tredennick and Tom Gricks on the TAR Talk podcast about his recent article TAR, Proportionality, and Bad Algorithms (1-NN).  The podcast will be on Tuesday, November 20, 2018 (podcast description and registration page is here).  You can download the recording here:
RECORDED PODCAST

TAR vs. Keyword Search Challenge, Round 4

This iteration of the challenge was performed during the Digging into TAR session at the 2018 Northeast eDiscovery & IG Retreat.  The structure was similar to round 3, but the audience was bigger.  As before, the goal was to see whether the audience could construct a keyword search query that performed better than technology-assisted review.

There are two sensible ways to compare performance.  Either see which approach reaches a fixed level of recall with the least review effort, or see which approach reaches the highest level of recall with a fixed amount of review effort.  Any approach comparing results having different recall and different review effort cannot give a definitive conclusion on which result is best without making arbitrary assumptions about a trade off between recall and effort (this is why performance measures, such as the F1 score, that mix recall and precision together are not sensible for ediscovery).

For the challenge we fixed the amount of review effort and measured the recall achieved, because that was an easier process to carry out under the circumstances.  Specifically, we took the top 3,000 documents matching the search query, reviewed them (this was instantaneous because the whole population was reviewed in advance), and measured the recall achieved.  That was compared to the recall for a TAR 3.0 process where 200 cluster centers were reviewed for training and then the top-scoring 2,800 documents were reviewed.  If the system was allowed to continue learning while the top-scoring documents were reviewed, the result was called “TAR 3.0 CAL.”  If learning was terminated after review of the 200 cluster centers, the result was called “TAR 3.0 SAL.”  The process was repeated with 6,000 documents instead of 3,000 so you can see how much recall improves if you double the review effort.

Individuals in the audience submitted queries through a web form using smart phones or laptops and I executed some (due to limited time) of the queries in front of the audience.  They could learn useful keywords from the documents matching the queries and tweak their queries and resubmit them.  Unlike a real ediscovery project, they had very limited time and no familiarity with the documents.  The audience could choose to work on any of three topics: biology, medical industry, or law.  In the results below, the queries are labeled with the submitters’ initials (some people gave only a first name, so there is only one initial) followed by a number if they submitted more than one query.  Two queries were omitted because they had less than 1% recall (the participants apparently misunderstood the task).  The queries that were evaluated in front of the audience were E-1, U, AC-1, and JM-1.  The discussion of the result follows the tables, graphs, and queries.

Biology Recall
Query Top 3,000 Top 6,000
E-1 32.0% 49.9%
E-2 51.7% 60.4%
E-3 48.4% 57.6%
E-4 45.8% 60.7%
E-5 43.3% 54.0%
E-6 42.7% 57.2%
TAR 3.0 SAL 72.5% 91.0%
TAR 3.0 CAL 75.5% 93.0%
Medical Recall
Query Top 3,000 Top 6,000
U 17.1% 27.9%
TAR 3.0 SAL 67.3% 83.7%
TAR 3.0 CAL 80.7% 88.5%
Law Recall
Query Top 3,000 Top 6,000
AC-1 16.4% 33.2%
AC-2 40.7% 54.4%
JM-1 49.4% 69.3%
JM-2 55.9% 76.4%
K-1 43.5% 60.6%
K-2 43.0% 62.6%
C 32.9% 47.2%
R 55.6% 76.6%
TAR 3.0 SAL 63.5% 82.3%
TAR 3.0 CAL 77.8% 87.8%

tar_vs_search4_biology

tar_vs_search4_medical

tar_vs_search4_law

E-1) biology OR microbiology OR chemical OR pharmacodynamic OR pharmacokinetic
E-2) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence
E-3) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence OR pharmacokinetic OR processes OR lysis
E-4) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence OR pharmacokinetic OR processes OR lysis OR study
E-5) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence OR pharmacokinetic OR processes OR lysis OR study OR table
E-6) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence OR pharmacokinetic OR processes OR lysis OR study OR table OR research
U) Transplant OR organ OR cancer OR hypothesis
AC-1) law
AC-2) legal OR attorney OR (defendant AND plaintiff) OR precedent OR verdict OR deliberate OR motion OR dismissed OR granted
JM-1) Law OR legal OR attorney OR lawyer OR litigation OR liability OR lawsuit OR judge
JM-2) Law OR legal OR attorney OR lawyer OR litigation OR liability OR lawsuit OR judge OR defendant OR plaintiff OR court OR plaintiffs OR attorneys OR lawyers OR defense
K-1) Law OR lawyer OR attorney OR advice OR litigation OR court OR investigation OR subpoena
K-2) Law OR lawyer OR attorney OR advice OR litigation OR court OR investigation OR subpoena OR justice
C) (law OR legal OR criminal OR civil OR litigation) AND NOT (politics OR proposed OR pending)
R) Court OR courtroom OR judge OR judicial OR judiciary OR law OR lawyer OR legal OR plaintiff OR plaintiffs OR defendant OR defendants OR subpoena OR sued OR suing OR sue OR lawsuit OR injunction OR justice

None of the keyword searches achieved higher recall than TAR when the amount of review effort was equal.  All six of the biology queries were submitted by one person.  The first query was evaluated in front of the audience, and his first revision to the query did help, but subsequent (blind) revisions of the query tended to hurt more than they helped.  For biology, review of 3,000 documents with TAR gave better recall than review of 6,000 documents with any of the queries.  There was only a single query submitted for the medical industry, and it underperformed TAR substantially.  Five people submitted a total of eight queries for the law category, and the audience had the best results for that topic, which isn’t surprising since an audience full of lawyers and litigation support people would be expected to be especially good at identifying keywords related to the law.  Even the best queries had lower recall with review of 6,000 documents than TAR 3.0 CAL achieved with review of only 3,000 documents, but a few of the queries did achieve higher recall than TAR 3.0 SAL when twice as much document review was performed with the search query compared to TAR 3.0 SAL.

TAR vs. Keyword Search Challenge, Round 3

This iteration of the challenge, held at the Education Hub at ILTACON 2018, was structured somewhat differently from round 1 and round 2 to give the audience a better chance of beating TAR.  Instead of submitting search queries on paper, participants submitted them through a web form using their phones, which allowed them to repeatedly tweak their queries and resubmit them.  I executed the queries in front of the participants, so they could see the exact recall achieved (since all documents were marked as relevant or non-relevant by a human reviewer in advance) almost instantaneously and they could utilize the performance information for their queries and the queries of other participants to guide improvements to their queries. This actually gave the participants an advantage over what they would experience in a real e-discovery project since performance measurements would normally require human evaluation of a random sample from the search output, which would make execution of several iterations of a query guided by performance evaluations very expensive in terms of review labor.  The audience got those performance evaluations for free even though the goal was to compare recall achieved for equal amounts of document review effort.  On the other hand, the audience did still have the disadvantages of having limited time and no familiarity with the documents.

As before, recall was evaluated for the top 3000 and top 6000 documents, which was enough to achieve high recall with TAR (even with the training documents included, so total review effort for TAR and the search queries was the same).  Audience members were free to work on any of the three topics that were used in previous versions of the challenge: law, medical industry, or biology.  Unfortunately, the audience was much smaller than previous versions of the challenge, and nobody chose to submit a query for the biology topic.

Previously, the TAR results were achieved by using the TAR 3.0 workflow to train with 200 cluster centers, documents were sorted based on the resulting relevance scores, and top-scoring documents were reviewed until the desired amount of review effort was expended without allowing predictions to be updated during that review (e.g., review of 200 training docs plus 2,800 top scoring docs to get the “Top 3,000” result).  I’ll call this TAR 3.0 SAL (SAL = Simple Active Learning, meaning the system is not allowed to learn during the review of top-scoring documents).  In practice you wouldn’t do that.  If you were reviewing top-scoring documents, you would allow the system to continue learning (CAL).  You would use SAL only if you were producing top-scoring documents without reviewing them since allowing learning to continue during the review would reduce the amount of review needed to achieve a desired level of recall.  I used TAR 3.0 SAL in previous iterations because I wanted to simulate the full review in front of the audience in a few seconds and TAR 3.0 CAL would have been slower.  This time, I did the TAR calculations in advance and present both the SAL and CAL results so you can see how much difference the additional learning from CAL made.

One other difference compared to previous versions of the challenge is how I’ve labeled the queries below.  This time, the number indicates which participant submitted the query and the letter indicates which one of his/her queries are being analyzed (if the person submitted more than one) rather than indicating a tweaking of the query that I added to try to improve the result.  In other words, all variations were tweaks done by the audience instead of by me.  Discussion of the results follows the tables, graphs, and queries below.

Recall
Medical Industry Top 3,000 Top 6,000
1a 3.0%
1b 17.4%
TAR 3.0 SAL 67.3% 83.7%
TAR 3.0 CAL 80.7% 88.5%

 

Recall
Law Top 3,000 Top 6,000
2 1.0%
3a 36.1% 42.3%
3b 45.3% 60.1%
3c 47.2% 62.6%
4 11.6% 13.8%
TAR 3.0 SAL 63.5% 82.3%
TAR 3.0 CAL 77.8% 87.8%

tar_vs_search3_medical

tar_vs_search3_law

 

1a)  Hospital AND New AND therapies
1b)  Hospital AND New AND (physicians OR doctors)
2)   Copyright AND mickey AND mouse
3a)  Schedule OR Amendments OR Trial OR Jury OR Judge OR Circuit OR Courtroom OR Judgement
3b)  Amendments OR Trial OR Jury OR Judge OR Circuit OR Courtroom OR Judgement OR trial OR law OR Patent OR legal
3c)  Amendments OR Trial OR Jury OR Judge OR Circuit OR Courtroom OR Judgement OR trial OR law OR Patent OR legal OR Plaintiff OR Defendant
4)  Privacy OR (Personally AND Identifiable AND Information) OR PII OR (Protected AND Speech)

TAR won across the board, as in previous iterations of the challenge.  Only one person submitted queries for the medical industry topic.  His/her revised query did a better job of finding relevant documents, but still returned fewer than 3,000 documents and fared far worse than TAR — the query was just not broad enough to achieve high recall.  Three people submitted queries on the law topic.  One of those people revised the query a few times and got decent results (shown in green), but still fell far short of the TAR result, with review of 6,000 documents from the best query finding fewer relevant documents than review of half as many documents with TAR 3.0 SAL (TAR 3.0 CAL did even better).  It is unfortunate that the audience was so small, since a larger audience might have done better by learning from each other’s submissions.  Hopefully I’ll be able to do this with a bigger audience in the future.

TAR, Proportionality, and Bad Algorithms (1-NN)

Should proportionality arguments allow producing parties to get away with poor productions simply because they wasted a lot of effort due to an extremely bad algorithm?  This article examines one such bad algorithm that has been used in major review platforms, and shows that it could be made vastly more effective with a very minor tweak.  Are lawyers who use platforms lacking the tweak committing malpractice by doing so?

Last year I was moderating a panel on TAR (predictive coding) and I asked the audience what recall level they normally aim for when using TAR.  An attendee responded that it was a bad question because proportionality only required a reasonable effort.  Much of the audience expressed agreement.  This should concern everyone.  If quality of result (e.g., achieving a certain level of recall) is the goal, the requesting party really has no business asking how the result was achieved–any effort wasted by choosing a bad algorithm is born by the producing party.  On the other hand, if the target is expenditure of a certain amount of effort, doesn’t the requesting party have the right to know and object if the producing party has chosen a methodology that is extremely inefficient?

The algorithm I’ll be picking on today is a classifier called 1-nearest neighbor, or 1-NN.  You may be using it without ever having heard that name, so pay attention to my description of it and see if it sounds familiar.  To predict whether a document is relevant, 1-NN finds the single most similar training document and predicts the relevance of the unreviewed document to be the same.  If a relevance score is desired instead of a yes/no relevance prediction, the relevance score can be taken to be the similarity value if the most similar training document is relevant, and it can be taken to be the negative of the similarity value if the most similar training document is non-relevant.  Here is a precision-recall curve for the 1-NN algorithm used in a TAR 1.0 workflow trained with randomly-selected documents:

knn_1

The precision falls off a cliff above 60% recall.  This is not due to inadequate training–the cliff shown above will not go away no matter how much training data you add.  To understand the implications, realize that if you sort the documents by relevance score and review from the top down until you reach the desired level of recall, 1/P at that recall tells the average number of documents you’ll review for each relevant document you find.  At 60% recall, precision is 67%, so you’ll review 1.5 documents (1/0.67 = 1.5) for each relevant document you find.  There is some effort wasted in reviewing those 0.5 non-relevant documents for each relevant document you find, but it’s not too bad.  If you keep reviewing documents until you reach 70% recall, things get much worse.  Precision drops to about 8%, so you’ll encounter so many non-relevant documents after you get past 60% recall that you’ll end up reviewing 12.5 documents for each relevant document you find.  You would surely be tempted to argue that proportionality says you should be able to stop at 60% recall because the small gain in result quality of going from 60% recall to 70% recall would cost nearly ten times as much review effort.  But does it really have to be so hard to get to 70% recall?

It’s very easy to come up with an algorithm that can reach higher recall without so much review effort once you understand why the performance cliff occurs.  When you sort the documents by relevance score with 1-NN, the documents where the most similar training document is relevant will be at the top of the list.  The performance cliff occurs when you start digging into the documents where the most similar training document is non-relevant.  The 1-NN classifier does a terrible job of determining which of those documents has the best chance of being relevant because it ignores valuable information that is available.  Consider two documents, X and Y, that both have a non-relevant training document as the most similar training document, but document X has a relevant training document as the second most similar training document and document Y has a non-relevant training document as the second most similar.  We would expect X to have a better chance of being relevant than Y, all else being equal, but 1-NN cannot distinguish between the two because it pays no attention to the second most similar training document.  Here is the result for 2-NN, which takes the two most similar training document into account:

knn_2

Notice that 2-NN easily reaches 70% recall (1/P is 1.6 instead of 12.5), but it does have a performance cliff of its own at a higher level of recall because it fails to make use of information about the third most similar training document.  If we utilize information about the 40 most similar training documents we get much better performance as shown by the solid lines here:

knn_40

It was the presence of non-relevant training documents that tripped up the 1-NN algorithm because the non-relevant training document effectively hid the existence of evidence (similar training documents that were relevant) that a document might be relevant, so you might think the performance cliff could be avoided by omitting non-relevant documents from the training.  The result of doing that is shown with dashed lines in the figure above.  Omitting non-relevant training documents does help 1-NN at high recall, though it is still far worse than 40-NN with the non-relevant training documents include (omitting the non-relevant training documents actually harms 40-NN, as shown by the red dashed line).  A workflow that focuses on reviewing documents that are likely to be relevant, such as TAR 2.0, rather than training with random documents, will be less impacted by 1-NN’s shortcomings, but why would you ever suffer the poor performance of 1-NN when 40-NN requires such a minimal modification of the algorithm?

You might wonder whether the performance cliff shown above is just an anomaly.  Here are precision-recall curves for several additional categorization tasks with 1-NN on the left and 40-NN on the right.

1nn_vs_40nn_several_tasks

Sometimes the 1-NN performance cliff occurs at high enough recall to allow a decent production, but sometimes it keeps you from finding even half of the relevant documents.  Should a court accept less than 50% recall when the most trivial tweak to the algorithm could have achieved much higher recall with roughly the same amount of document review?

Of course, there are many factors beyond the quality of the classifier, such as the choice of TAR 1.0 (SPL and SAL), TAR 2.0 (CAL), or TAR 3.0 workflows, that impact the efficiency of the process.  The research by Grossman and Cormack that courts have relied upon to justify the use of TAR because it reaches recall that is comparable to or better than an exhaustive human review is based on CAL (TAR 2.0) with good classifiers, whereas some popular software uses TAR 1.0 (less efficient if documents will be reviewed before production) and poor classifiers such as 1-NN.  If the producing party vows to reach high recall and bears the cost of choosing bad software and/or processes to achieve that, there isn’t much for the requesting party to complain about  (though the producing party could have a bone to pick with an attorney or service provider who recommended an inefficient approach). On the other hand, if the producing party argues that low recall should be tolerated because decent recall would require too much effort, it seems that asking whether the algorithms used are unnecessarily inefficient would be appropriate.

TAR vs. Keyword Search Challenge, Round 2

During my presentation at the South Central eDiscovery & IG Retreat I challenged the audience to create keyword searches that would work better than technology-assisted review (predictive coding).  This is similar to the experiment done a few months earlier.  See this article for more details.  The audience again worked in groups to construct keyword searches for two topics.  One topic, articles on law, was the same as last time.  The other topic, the medical industry, was new (it replaced biology).

Performance was evaluated by comparing the recall achieved for equal amounts of document review effort (the population was fully categorized in advance, so measurements are exact, not estimates).  Recall for the top 3000 keyword search matches was compared to recall from reviewing 202 training documents (2 seed documents plus 200 cluster centers using the TAR 3.0 method) and 2798 documents having the highest relevance scores from TAR.  Similarly, recall from the top 6000 keyword search matches was compared to recall from review of 6000 documents with TAR.  Recall from all documents matching a search query was also measured to find the maximum recall that could be achieved with the query.

The search queries are shown after the performance tables and graphs.  When there is an “a” and “b” version of the query, the “a” version was the audience’s query as-is, and the “b” query was tweaked by me to remove restrictions that were limiting the number of relevant documents that could be found.  The results are discussed at the end of the article.

Medical Industry Recall
Query Total Matches Top 3,000 Top 6,000 All
1a 1,618 14.4% 14.4%
1b 3,882 32.4% 40.6% 40.6%
2 7,684 30.3% 42.2% 46.6%
3a 1,714 22.4% 22.4%
3b 16,756 32.7% 44.6% 71.1%
4a 33,925 15.3% 20.3% 35.2%
4b 58,510 27.9% 40.6% 94.5%
TAR 67.3% 83.7%

 

Law Recall
Query Total Matches Top 3,000 Top 6,000 All
5 36,245 38.8% 56.4% 92.3%
6 25,370 51.9% 72.4% 95.7%
TAR 63.5% 82.3%

tar_vs_search2_medical

tar_vs_search2_law

 

1a) medical AND (industry OR business) AND NOT (scientific OR research)
1b) medical AND (industry OR business)
2) (revenue OR finance OR market OR brand OR sales) AND (hospital OR health OR medical OR clinical)
3a) (medical OR hospital OR doctor) AND (HIPPA OR insurance)
3b) medical OR hospital OR doctor OR HIPPA OR insurance
4a) (earnings OR profits OR management OR executive OR recall OR (board AND directors) OR healthcare OR medical OR health OR hospital OR physician OR nurse OR marketing OR pharma OR report OR GlaxoSmithKline OR (united AND health) OR AstraZeneca OR Gilead OR Sanofi OR financial OR malpractice OR (annual AND report) OR provider OR HMO OR PPO OR telemedicine) AND NOT (study OR research OR academic)
4b) earnings OR profits OR management OR executive OR recall OR (board AND directors) OR healthcare OR medical OR health OR hospital OR physician OR nurse OR marketing OR pharma OR report OR GlaxoSmithKline OR (united AND health) OR AstraZeneca OR Gilead OR Sanofi OR financial OR malpractice OR (annual AND report) OR provider OR HMO OR PPO OR telemedicine
5) FRCP OR Fed OR litigation OR appeal OR immigration OR ordinance OR legal OR law OR enact OR code OR statute OR subsection OR regulation OR rules OR precedent OR (applicable AND law) OR ruling
6) judge OR (supreme AND court) OR court OR legislation OR legal OR lawyer OR judicial OR law OR attorney

As before, TAR won across the board, but there were some surprises this time.

For the medical industry topic, review of 3000 documents with TAR achieved higher recall than any keyword search achieved with review of 6000 documents, very similar to results from a few months ago.  When all documents matching the medical industry search queries were analyzed, two queries did achieve high recall (3b and 4b, which are queries I tweaked to achieve higher recall), but they did so by retrieving a substantial percentage of the 100,000 document population (16,756 and 58,510 documents respectively).  TAR can reach any level of recall by simply taking enough documents from the sorted list—TAR doesn’t run out of matches like a keyword search does.  TAR matches the 94.6% recall that query 4b achieved (requiring review of 58,510 documents) with review of only 15,500 documents.

Results for the law topic were more interesting.  The two queries submitted for the law topic both performed better than any of the queries submitted for that topic a few months ago.  Query 6 gave the best results, with TAR beating it by only a modest amount.  If all 25,370 documents matching query 6 were reviewed, 95.7% recall would be achieved, which TAR could accomplish with review of 24,000 documents.  It is worth noting that TAR 2.0 would be more efficient, especially at very high recall.  TAR 3.0 gives the option to produce documents without review (not utilized for this exercise), plus computations are much faster due to there being vastly fewer training documents, which is handy for simulating a full review live in front of an audience in a few seconds.

TAR vs. Keyword Search Challenge

During my presentation at the NorCal eDiscovery & IG Retreat I challenged the audience to create keyword searches that would work better than technology-assisted review (predictive coding) for two topics.  Half of the room was tasked with finding articles about biology (science-oriented articles, excluding medical treatment) and the other half searched for articles about current law (excluding proposed laws or politics).  I ran one of the searches against TAR in Clustify live during the presentation (Clustify’s “shadow tags” feature allows a full document review to be simulated in a few minutes using documents that were pre-categorized by human reviewers), but couldn’t do the rest due to time constraints.  This article presents the results for all the queries submitted by the audience.

The audience had limited time to construct queries (working together in groups), they weren’t familiar with the data set, and they couldn’t do sampling to tune their queries, so I’m not claiming the exercise was comparable to an e-discovery project.  Still, it was entertaining.  The topics are pretty simple, so a large percentage of the relevant documents can be found with a pretty simple search using some broad terms.  For example, a search for “biology” would find 37% of the biology documents.  A search for “law” would find 71% of the law articles.  The trick is to find the relevant documents without pulling in too many of the non-relevant ones.

To evaluate the results, I measured the recall (percentage of relevant documents found) from the top 3,000 and top 6,000 hits on the search query (3% and 6% of the population respectively).  I’ve also included the recall achieved by looking at all docs that matched the search query, just to see what recall the search queries could achieve if you didn’t worry about pulling in a ton of non-relevant docs.  For the TAR results I used TAR 3.0 trained with two seed documents (one relevant from a keyword search and one random non-relevant document) followed by 20 iterations of 10 top-scoring cluster centers, so a total of 202 training documents (no control set needed with TAR 3.0).  To compare to the top 3,000 search query matches, the 202 training documents plus 2,798 top-scoring documents were used for TAR, so the total document review (including training) would be the same for TAR and the search query.

The search engine in Clustify is intended to help the user find a few seed documents to get active learning started, so it has some limitations.  If the audience’s search query included phrases, they were converted an AND search enclosed in parenthesis.  If the audience’s query included a wildcard, I converted it to a parenthesized OR search by looking at the matching words in the index and selecting only the ones that made sense (i.e., I made the queries better than they would have been with an actual wildcard).  I noticed that there were a lot of irrelevant words that matched the wildcards.  For example, “cell*” in a biology search should match cellphone, cellular, cellar, cellist, etc., but I excluded such words.  I would highly recommend that people using keyword search check to see what their wildcards are actually matching–you may be pulling in a lot of irrelevant words.  I removed a few words from the queries that weren’t in the index (so the words shown all actually had an impact).  When there is an “a” and “b” version of the query, the “a” version was the audience’s query as-is, and the “b” query was tweaked by me to retrieve more documents.

The tables below show the results.  The actual queries are displayed below the tables.  Discussion of the results is at the end.

Biology Recall
Query Total Matches Top 3,000 Top 6,000 All Matches
1 4,407 34.0% 47.2% 47.2%
2 13,799 37.3% 46.0% 80.9%
3 25,168 44.3% 60.9% 87.8%
4a 42 0.5% 0.5%
4b 2,283 20.9% 20.9%
TAR 72.1% 91.0%
Law Recall
Query Total Matches Top 3,000 Top 6,000 All Matches
5a 2,914 35.8% 35.8%
5b 9,035 37.2% 49.3% 60.6%
6 534 2.9% 2.9%
7 27,288 32.3% 47.1% 79.1%
TAR 62.3% 80.4%

tar_vs_search_biology

tar_vs_search_law

1) organism OR microorganism OR species OR DNA

2) habitat OR ecology OR marine OR ecosystem OR biology OR cell OR organism OR species OR photosynthesis OR pollination OR gene OR genetic OR genome AND NOT (treatment OR generic OR prognosis OR placebo OR diagnosis OR FDA OR medical OR medicine OR medication OR medications OR medicines OR medicated OR medicinal OR physician)

3) biology OR plant OR (phyllis OR phylos OR phylogenetic OR phylogeny OR phyllo OR phylis OR phylloxera) OR animal OR (cell OR cells OR celled OR cellomics OR celltiter) OR (circulation OR circulatory) OR (neural OR neuron OR neurotransmitter OR neurotransmitters OR neurological OR neurons OR neurotoxic OR neurobiology OR neuromuscular OR neuroscience OR neurotransmission OR neuropathy OR neurologically OR neuroanatomy OR neuroimaging OR neuronal OR neurosciences OR neuroendocrine OR neurofeedback OR neuroscientist OR neuroscientists OR neurobiologist OR neurochemical OR neuromorphic OR neurohormones OR neuroscientific OR neurovascular OR neurohormonal OR neurotechnology OR neurobiologists OR neurogenetics OR neuropeptide OR neuroreceptors) OR enzyme OR blood OR nerve OR brain OR kidney OR (muscle OR muscles) OR dna OR rna OR species OR mitochondria

4a) statistically AND ((laboratory AND test) OR species OR (genetic AND marker) OR enzyme) AND NOT (diagnosis OR treatment OR prognosis)

4b)  (species OR (genetic AND marker) OR enzyme) AND NOT (diagnosis OR treatment OR prognosis)

5a) federal AND (ruling OR judge OR justice OR (appellate OR appellant))

5b) ruling OR judge OR justice OR (appellate OR appellant)

6) amendments OR FRE OR whistleblower

7) ((law OR laws OR lawyer OR lawyers OR lawsuit OR lawsuits OR lawyering) OR (regulation OR regulations) OR (statute OR statutes) OR (standards)) AND NOT pending

TAR beat keyword search across the board for both tasks.  The top 3,000 documents returned by TAR achieved higher recall than the top 6,000 documents for any keyword search.  In other words, if documents will be reviewed before production, TAR achieves better results (higher recall) with half as much document review compared to any of the keyword searches.  The top 6,000 documents returned by TAR achieved higher recall than all of the documents matching any individual keyword search, even when the keyword search returned 27,000 documents.

Similar experiments were performed later with many similarities but also some notable differences in the structure of the challenge and the results.  You can read about them here: round 2, round 3, round 4, round 5, and round 6.

Highlights from DESI VII / ICAIL 2017

DESI (Discovery of Electronically Stored Information) is a one-day workshop within ICAIL (International Conference on Artificial Intelligence and Law), which is held every other year.  The conference was held in London last month.  Rumor has it that the next ICAIL will be in North America, perhaps Montreal.

I’m not going to go into the DESI talks based on papers and slides that are posteddesi_vii_lecture on the DESI VII website since you can read that content directly.  The workshop opened with a keynote by Maura Grossman and Gordon Cormack where they talked about the history of TREC tracks that are relevant to e-discovery (Spam, Legal, and Total Recall), the limitation on the recall that can be achieved due to ambiguous relevance (reviewer disagreement) for some documents, and the need for high recall when it comes to identifying privileged documents or documents where privacy must be protected.  When looking for privileged documents it is important to note that many tools don’t make use of metadata.  Documents that are missed may be technically relevant but not really important — you should look at a sample to see whether they are important.

Between presentations based on submitted papers there was a lunch where people separated into four groups to discuss specific topics.  The first group focused on e-discovery users.  Visualizations were deemed “nice to look at” but not always useful — does the visualization help you to answer a question faster?  Another group talked about how to improve e-discovery, including attorney aversion to algorithms and whether a substantial number of documents could be missed by CAL after the gain curve had plateaued.  Another group discussed dreams about future technologies, like better case assessment and redacting video.  The fourth group talked about GDPR and speculated that the UK would obey GDPR.desi_vii_buckingham_palace

DESI ended with a panel discussion about future directions for e-discovery.  It was suggested that a government or consumer group should evaluate TAR systems.  Apparently, NIST doesn’t want to do it because it is too political.  One person pointed out that consumers aren’t really demanding it.  It’s not just a matter of optimizing recall and precision — process (quality control and workflow) matters, which makes comparisons hard.  It was claimed that defense attorneys were motivated to lobby against the federal rules encouraging the use of TAR because they don’t want incriminating things to be found.  People working in archiving are more enthusiastic about TAR.

Following DESI (and other workshops conducted in parallel on the first day), ICAIL had three more days of paper presentations followed by another day of workshops.  You can find the schedule is here.  I only attended the first day of non-DESI presentations.  There are two papers from that day that I want to point out.  The first is  Effectiveness Results for Popular e-Discovery Algorithms by Yang, David Grossman, Frieder, and Yurchak.  They compared performance of the CAL (relevance feedback) approach to TAR for several different classification algorithms, feature types, feature weightings,  desi_vii_guardand with/without LSI.  They used several different performance metrics, though they missed the one I think is most relevant for e-discovery (review effort required to achieve an acceptable level of recall).  Still, it is interesting to see such an exhaustive comparison of algorithms used in TAR / predictive coding.  They’ve made their code available here.  The second paper is Scenario Analytics: Analyzing Jury Verdicts to Evaluate Legal Case Outcomes by Conrad and Al-Kofahi.  The authors analyze a large database of jury verdicts in an effort to determine the feasibility of building a system to give strategic litigation advice (e.g., potential award size, trial duration, and suggested claims) based on a data-driven analysis of the case.