# Misleading Metrics and Irrelevant Research (Accuracy and F1)

If one algorithm achieved 98.2% accuracy while another had 98.6% for the same task, would you be surprised to find that the first algorithm required ten times as much document review to reach 75% recall compared to the second algorithm?  This article explains why some performance metrics don’t give an accurate view of performance for ediscovery purposes, and why that makes a lot of research utilizing such metrics irrelevant for ediscovery.

The key performance metrics for ediscovery are precision and recall.  Recall, R, is the percentage of all relevant documents that have been found.  High recall is critical to defensibility.  Precision, P, is the percentage of documents predicted to be relevant that actually are relevant.  High precision is desirable to avoid wasting time reviewing non-relevant documents (if documents will be reviewed to confirm relevance and check for privilege before production).  In other words, precision is related to cost.  Specifically, 1/P is the average number of documents you’ll have to review per relevant document found.  When using technology-assisted review (predictive coding), documents can be sorted by relevance score and you can choose any point in the sorted list and compute the recall and precision that would be achieved by treating documents above that point as being predicted to be relevant.  One can plot a precision-recall curve by doing precision and recall calculations at various points in the sorted document list.

The precision-recall curve to the right compares two different classification algorithms applied to the same task.  To do a sensible comparison, we should compare precision values at the same level of recall.  In other words, we should compare the cost of reaching equally good (same recall) productions.  Furthermore, the recall level where the algorithms are compared should be one that is sensible for for ediscovery — achieving high precision at a recall level a court wouldn’t accept isn’t very useful.  If we compare the two algorithms at R=75%, 1-NN has P=6.6% and 40-NN has P=70.4%.  In other words, if you sort by relevance score with the two algorithms and review documents from top down until 75% of the relevant documents are found, you would review 15.2 documents per relevant document found with 1-NN and 1.4 documents per relevant document found with 40-NN.  The 1-NN algorithm would require over ten times as much document review as 40-NN.  1-NN has been used in some popular TAR systems.  I explained why it performs so badly in a previous article.

There are many other performance metrics, but they can be written as a mixture of precision and recall (see Chapter 7 of the current draft of my book).  Anything that is a mixture of precision and recall should raise an eyebrow — how can you mix together two fundamentally different things (defensibility and cost) into a single number and get a useful result?  Such metrics imply a trade-off between defensibility and cost that is not based on reality.  Research papers that aren’t focused on ediscovery often use such performance measures and compare algorithms without worrying about whether they are achieving the same recall, or whether the recall is high enough to be considered sufficient for ediscovery.  Thus, many conclusions about algorithm effectiveness simply aren’t applicable for ediscovery because they aren’t based on relevant metrics.

One popular metric is accuracy, which is the percentage of predictions that are correct.  If a system predicts that none of the documents are relevant and prevalence is 10% (meaning 10% of the documents are relevant), it will have 90% accuracy because its predictions were correct for all of the non-relevant documents.  If prevalence is 1%, a system that predicts none of the documents are relevant achieves 99% accuracy.  Such incredibly high numbers for algorithms that fail to find anything!  When prevalence is low, as it often is in ediscovery, accuracy makes everything look like it performs well, including algorithms like 1-NN that can be a disaster at high recall.  The graph to the right shows the accuracy-recall curve that corresponds to the earlier precision-recall curve (prevalence is 2.633% in this case), showing that it is easy to achieve high accuracy with a poor algorithm by evaluating it at a low recall level that would not be acceptable for ediscovery.  The maximum accuracy achieved by 1-NN in this case was 98.2% and the max for 40-NN was 98.6%.  In case you are curious, the relationship between accuracy, precision, and recall is:
$ACC = 1 - \rho (1 - R) - \rho R (1 - P) / P$
where $\rho$ is the prevalence.

Another popular metric is the F1 score.  I’ve criticized its use in ediscovery before.  The relationship to precision and recall is:
$F_1 = 2 P R / (P + R)$
The F1 score lies between the precision and the recall, and is closer to the smaller of the two.  As far as F1 is concerned, 30% recall with 90% precision is just as good as 90% recall with 30% precision (both give F1 = 0.45) even though the former probably wouldn’t be accepted by a court and the latter would.   F1 cannot be large at small recall, unlike accuracy, but it can be moderately high at modest recall, making it possible to achieve a decent F1 score even if performance is disastrously bad at the high recall levels demanded by ediscovery.  The graph to the right shows that 1-NN manages to achieve a maximum F1 of 0.64, which seems pretty good compared to the 0.73 achieved by 40-NN, giving no hint that 1-NN requires ten times as much review to achieve 75% recall in this example.

Hopefully this article has convinced you that it is important for research papers to use the right metric, specifically precision (or review effort) at high recall, when making algorithm comparisons that are useful for ediscovery.

# TAR vs. Keyword Search Challenge, Round 5

The audience was challenged to construct a keyword search query that is more effective than technology-assisted review (TAR) at IG3 West 2018.  The procedure was the same as round 4, so I won’t repeat the details here.  The audience was small this time and we only got one query submission for each topic.  The submission for the law topic used AND to join the keywords together and matched no articles, so I changed the ANDs to ORs before evaluating it.  The results and queries are below.  TAR beat the keyword searches by a huge margin this time.

 Biology Recall Query Top 3,000 Top 6,000 Search 20.1% 20.1% TAR 3.0 SAL 72.5% 91.0% TAR 3.0 CAL 75.5% 93.0%
 Medical Recall Query Top 3,000 Top 6,000 Search 28.5% 38.1% TAR 3.0 SAL 67.3% 83.7% TAR 3.0 CAL 80.7% 88.5%
 Law Recall Query Top 3,000 Top 6,000 Search 5.5% 9.4% TAR 3.0 SAL 63.5% 82.3% TAR 3.0 CAL 77.8% 87.8%

biology query: (Evolution OR develop) AND (Darwin OR bird OR cell)
medical query: Human OR body OR medicine OR insurance OR license OR doctor OR patient
law query: securities OR conspiracy OR RICO OR insider

# Highlights from IG3 West 2018

The IG3 West conference was held by Ing3nious at the Paséa Hotel & Spa in Huntington Beach, California. This conference differed from other recent Ing3nious events in several ways.  It was two days of presentations instead of one.  There were three simultaneous panels instead of two.  Between panels there were sometimes three simultaneous vendor technology demos.  There was an exhibit hall with over forty vendor tables.  Due to the different format, I was only able to attend about a third of the presentations.  My notes are below.  You can find my full set of photos here.

Stop Chasing Horses, Start Building Fences: How Real-Time Technologies Change the Game of Compliance and Governance

AI and the Corporate Law Department of the Future
Gartner says AI is at the peak of inflated expectations and a trough of disillusionment will follow.  Expect to be able to buy autonomous vehicles by 2023.  The economic downturn of 2008 caused law firms to start using metrics.  Legal will take a long time to adopt AI — managing partners still have assistants print stuff out.  Embracing AI puts a firm ahead of its competitors.  Ethical obligations are also an impediment to adoption of technology, since lawyers are concerned about understanding the result.

Advanced TAR Considerations: A 500 Level Crash Course
Continuous Active Learning (CAL), also called TAR 2.0, can adapt to shifts in the concept of relevance that may occur during the review.  There doesn’t seem to be much difference in the efficiency of SVM vs logistic regression when they are applied to the same task.  There can be a big efficiency difference between different tasks.  TAR 1.0 requires a subject-matter expert for training, but senior attorneys are not always readily available.  With TAR 1.0 you may be concerned that you will be required to disclose the training set (including non-responsive documents), but with TAR 2.0 there is case law that supports that being unnecessary [I’ve seen the argument that the production itself is the training set, but that neglects the non-responsive documents that were reviewed (and used for training) but not produced.  On the other hand, if you are taking about disclosing just the seed set that was used to start the process, that can be a single document and it has very little impact on the result.].  Case law can be found at predictivecoding.com, which is updated at the end of each year.  TAR needs text, not image data.  Sometimes keywords are good enough.  When it comes to government investigations, many agencies (FTC, DOJ) use/accept TAR.  It really depends on the individual investigator, though, and you can’t fight their decision (the investigator is the judge).  Don’t use TAR for government investigations without disclosing that you are doing so.  TAR can have trouble if there are documents having high conceptual similarity where some are relevant and some aren’t.  Should you tell opposing counsel that you’re using TAR?  Usually, but it depends on the situation.  When the situation is symmetrical, both sides tend to be reasonable.  When it is asymmetrical, the side with very little data may try to make things expensive for the other side, so say something like “both sides may use advanced technology to produce documents” and don’t give more detail than that (e.g., how TAR will be trained, who will do the training, etc.) or you may invite problems.  Disclosing the use of TAR up front and getting agreement may avoid problems later.  Be careful about “untrainable documents” (documents containing too little text) — separate them out, and maybe use meta data or file type to help analyze them.  Elusion testing can be used to make sure too many relevant documents weren’t missed.  One panelist said 384 documents could be sampled from the elusion set, though that may sometimes not be enough.  [I have to eat some crow here.  I raised my hand and pointed out that the margin of error for the elusion has to be divided by the prevalence to get the margin of error for the recall, which is correct.  I went on to say that with a sample of 384 giving ±5% for the elusion you would have ±50% for the recall if prevalence was 10%, making the measurement worthless.  The mistake is that while a sample of 384 technically implies a worst case of ±5% for the margin of error for elusion, it’s not realistic for the margin of error to be that bad for elusion because ±5% would occur if elusion was near 50%, but elusion is typically very small (smaller than the prevalence), causing the margin of error for the elusion to be significantly less than ±5%.  The correct margin of error for the recall from an elusion sample of 384 documents would be ±13% if the prevalence is 10%, and ±40% if the prevalence is 1%.  So, if prevalence is around 10% an elusion sample of 384 isn’t completely worthless (though it is much worse than the ±5% we usually aim for), but if prevalence is much lower than that it would be].

40 Years in 30 Minutes: The Background to Some of the Interesting Issues we Face

Digging Into TAR
I moderated this panel, so I didn’t take notes.  We did the TAR vs. Keyword Search Challenge again.  The results are available here.

After the Incident: Investigating and Responding to a Data Breach

Employing Technology/Next-Gen Tools to Reduce eDiscovery Spend
Have a process, but also think about what you are doing and the specifics of the case.  Restrict the date range if possible.  Reuse the results when you have overlapping cases (e.g., privilege review).  Don’t just look at docs/hour when monitoring the review.  Look at accuracy and get feedback about what they are finding.  CAL tends to result in doing too much document review (want to stop at 75% recall but end up hitting 89%).  Using a tool to do redactions will give false positives, so you need manual QC of the result.  When replacing a patient ID with a consistent anonymized identifier, you can’t just transform the ID because that could be inverted, resulting in a HIPAA violation.

eDiscovery for the Rest of us
What are ediscovery considerations for relatively small data sets?  During meet and confer, try to cooperate.  Judges hate ediscovery disputes.  Let the paralegals hash out the details — attorneys don’t really care about the details as long as it works.  Remote collection can avoid travel costs and hourly fees while keeping strangers out of the client’s office.  The biggest thing they look for from vendors is cost.  Need a certain volume of data for TAR to be practical.  Email threading can be used at any size.

Does Compliance Stifle or Spark Innovation?
Startups tend to be full of people fleeing big corporations to get away from compliance requirements. If you do compliance well, that can be an advantage over competitors.  Look at it as protecting the longevity of the business (protecting reputation, etc.).  At the DoD, compliance stifles innovation, but it creates a barrier against bad guys.  They have thousands of attacks per day and are about 8 years behind normal innovation.  Gray crimes are a area for innovation — examples include manipulation (influencing elections) and tanking a stock IPO by faking a poisoning.  Hospitals and law firms tend to pay, so they are prime targets for ransomware.

Panels That I Couldn’t Attend:
California and EU Privacy Compliance
What it all Comes Down to – Enterprise Cybersecurity Governance
Selecting eDiscovery Platforms and Vendors
Defensible Disposition of Data
Biometrics and the Evolving Legal Landscape
Storytelling in the Age of eDiscovery
Technology Solution Update From Corporate, Law Firm and Service Provider Perspective
The Internet of Things and Everything as a Service – the Convergence of Security, Privacy and Product Liability
Similarities and Differences Between the GDPR and the New California Consumer Privacy Act – Similar Enough?
The Impact of the Internet of Things on eDiscovery
Escalating Cyber Risk From the IT Department to the Boardroom
So you Weren’t Quite Ready for GDPR?
Security vs. Compliance and Why Legal Frameworks Fall Short to Improve Information Security
How to Clean up Files for Governance and GDPR
Deception, Active Defense and Offensive Security…How to Fight Back Without Breaking the Law?
Information Governance – Separating the “Junk” from the “Jewels”
What are Big Law Firms Saying About Their LegalTech Adoption Opportunities and Challenges?
Cyber and Data Security for the GC: How to Stay out of Headlines and Crosshairs

# Highlights from Text Analytics Forum 2018

Text Analytics Forum is part of KMWorld.  It was held on November 7-8 at the JW Marriott in D.C..  Attendees went to the large KMWorld keynotes in the morning and had two parallel text analytics tracks for the remainder of the day.  There was a technical track and an applications track.  Most of the slides are available here.  My photos, including photos of some slides that caught my attention or were not available on the website, are available here.  Since most slides are available online, I have only a few brief highlights below.  Next year’s KMWorld will be November 5-7, 2019.

The Think Creatively & Make Better Decisions keynote contained various interesting facts about the things that distract us and make us unproductive.  Distracted driving causes more deaths than drunk driving.  Attention spans have dropped from 12 seconds to 8 seconds (goldfish have a 9-second attention span).  Japan has texting lanes for walking.  71% of business meetings are unproductive, and 33% of employee time is spent in meetings. 281 billion emails were sent in 2018.  Don’t leave ideas and creative thinking to the few.  Mistakes shouldn’t be reprimanded.  Break down silos between departments.

The Deep Text Look at Text Analytics keynote explained that text mining is only part of text analytics.  Text mining treats words as things, whereas text analytics cares about meaning.  Sentiment analysis is now learning to handle things like: “I would have loved your product except it gave me a headache.”  It is hard for humans to pick good training documents for automatic categorization systems (what the e-discovery world calls predictive coding or technology-assisted review).  Computer-generated taxonomies are incredibly bad.  Deep learning is not like what humans do.  Deep learning takes 100,000 examples to detect a pattern, whereas humans will generalize (perhaps wrongly) from 2 examples.

The Cognitive Computing keynote mentioned that sarcasm makes sentiment analysis difficult.  For example: “I’m happy to spend a half hour of my lunch time in line at your bank.”  There are products to measure tone from audio and video.

The Don’t Stop at Stopwords: Function Words in Text Analytics session noted that function words, unlike content words, are added by the writer subconsciously.  Use of words like “that” or “the” instead of “this” can indicate the author is distancing himself/herself from the thing being described, possibly indicating deception.  They’ve used their techniques in about 20 different languages.  They need at least 300 words to make use of function word frequency to build a baseline.

The Should We Consign All Taxonomies to the Dustbin? talk considered the possibility of using machine learning to go directly from problem to solution without having a taxonomy in between.  He said that 100k documents or 1 million words of text are needed to get going.

# Podcast: Can You Do Good TAR with a Bad Algorithm?

Bill Dimm will be speaking with John Tredennick and Tom Gricks on the TAR Talk podcast about his recent article TAR, Proportionality, and Bad Algorithms (1-NN).  The podcast will be on Tuesday, November 20, 2018 (podcast description and registration page is here).  You can download the recording here:
RECORDED PODCAST

# Best Legal Blog Contest 2018

From a field of hundreds of potential nominees, the Clustify Blog received enough nominations to be selected to compete in The Expert Institute’s Best Legal Blog Contest in the Legal Tech category.

Now that the blogs have been nominated and placed into categories, it is up to readers to select the very best.  Each blog will compete for rank within its category, with the three blogs receiving the most votes in each category being crowned overall winners.  A reader can vote for as many blogs as he/she wants in each category, but can vote for a specific blog only once (this is enforced by requiring authentication with Google, LinkedIn, or Twitter).  Voting closes at 12:00 AM on December 17th, at which point the votes will be tallied and the winners announced.  You can find the Clustify Blog voting page here.

# TAR vs. Keyword Search Challenge, Round 4

This iteration of the challenge was performed during the Digging into TAR session at the 2018 Northeast eDiscovery & IG Retreat.  The structure was similar to round 3, but the audience was bigger.  As before, the goal was to see whether the audience could construct a keyword search query that performed better than technology-assisted review.

There are two sensible ways to compare performance.  Either see which approach reaches a fixed level of recall with the least review effort, or see which approach reaches the highest level of recall with a fixed amount of review effort.  Any approach comparing results having different recall and different review effort cannot give a definitive conclusion on which result is best without making arbitrary assumptions about a trade off between recall and effort (this is why performance measures, such as the F1 score, that mix recall and precision together are not sensible for ediscovery).

For the challenge we fixed the amount of review effort and measured the recall achieved, because that was an easier process to carry out under the circumstances.  Specifically, we took the top 3,000 documents matching the search query, reviewed them (this was instantaneous because the whole population was reviewed in advance), and measured the recall achieved.  That was compared to the recall for a TAR 3.0 process where 200 cluster centers were reviewed for training and then the top-scoring 2,800 documents were reviewed.  If the system was allowed to continue learning while the top-scoring documents were reviewed, the result was called “TAR 3.0 CAL.”  If learning was terminated after review of the 200 cluster centers, the result was called “TAR 3.0 SAL.”  The process was repeated with 6,000 documents instead of 3,000 so you can see how much recall improves if you double the review effort.

Individuals in the audience submitted queries through a web form using smart phones or laptops and I executed some (due to limited time) of the queries in front of the audience.  They could learn useful keywords from the documents matching the queries and tweak their queries and resubmit them.  Unlike a real ediscovery project, they had very limited time and no familiarity with the documents.  The audience could choose to work on any of three topics: biology, medical industry, or law.  In the results below, the queries are labeled with the submitters’ initials (some people gave only a first name, so there is only one initial) followed by a number if they submitted more than one query.  Two queries were omitted because they had less than 1% recall (the participants apparently misunderstood the task).  The queries that were evaluated in front of the audience were E-1, U, AC-1, and JM-1.  The discussion of the result follows the tables, graphs, and queries.

 Biology Recall Query Top 3,000 Top 6,000 E-1 32.0% 49.9% E-2 51.7% 60.4% E-3 48.4% 57.6% E-4 45.8% 60.7% E-5 43.3% 54.0% E-6 42.7% 57.2% TAR 3.0 SAL 72.5% 91.0% TAR 3.0 CAL 75.5% 93.0%
 Medical Recall Query Top 3,000 Top 6,000 U 17.1% 27.9% TAR 3.0 SAL 67.3% 83.7% TAR 3.0 CAL 80.7% 88.5%
 Law Recall Query Top 3,000 Top 6,000 AC-1 16.4% 33.2% AC-2 40.7% 54.4% JM-1 49.4% 69.3% JM-2 55.9% 76.4% K-1 43.5% 60.6% K-2 43.0% 62.6% C 32.9% 47.2% R 55.6% 76.6% TAR 3.0 SAL 63.5% 82.3% TAR 3.0 CAL 77.8% 87.8%

E-1) biology OR microbiology OR chemical OR pharmacodynamic OR pharmacokinetic
E-2) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence
E-3) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence OR pharmacokinetic OR processes OR lysis
E-4) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence OR pharmacokinetic OR processes OR lysis OR study
E-5) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence OR pharmacokinetic OR processes OR lysis OR study OR table
E-6) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence OR pharmacokinetic OR processes OR lysis OR study OR table OR research
U) Transplant OR organ OR cancer OR hypothesis
AC-1) law
AC-2) legal OR attorney OR (defendant AND plaintiff) OR precedent OR verdict OR deliberate OR motion OR dismissed OR granted
JM-1) Law OR legal OR attorney OR lawyer OR litigation OR liability OR lawsuit OR judge
JM-2) Law OR legal OR attorney OR lawyer OR litigation OR liability OR lawsuit OR judge OR defendant OR plaintiff OR court OR plaintiffs OR attorneys OR lawyers OR defense
K-1) Law OR lawyer OR attorney OR advice OR litigation OR court OR investigation OR subpoena
K-2) Law OR lawyer OR attorney OR advice OR litigation OR court OR investigation OR subpoena OR justice
C) (law OR legal OR criminal OR civil OR litigation) AND NOT (politics OR proposed OR pending)
R) Court OR courtroom OR judge OR judicial OR judiciary OR law OR lawyer OR legal OR plaintiff OR plaintiffs OR defendant OR defendants OR subpoena OR sued OR suing OR sue OR lawsuit OR injunction OR justice

None of the keyword searches achieved higher recall than TAR when the amount of review effort was equal.  All six of the biology queries were submitted by one person.  The first query was evaluated in front of the audience, and his first revision to the query did help, but subsequent (blind) revisions of the query tended to hurt more than they helped.  For biology, review of 3,000 documents with TAR gave better recall than review of 6,000 documents with any of the queries.  There was only a single query submitted for the medical industry, and it underperformed TAR substantially.  Five people submitted a total of eight queries for the law category, and the audience had the best results for that topic, which isn’t surprising since an audience full of lawyers and litigation support people would be expected to be especially good at identifying keywords related to the law.  Even the best queries had lower recall with review of 6,000 documents than TAR 3.0 CAL achieved with review of only 3,000 documents, but a few of the queries did achieve higher recall than TAR 3.0 SAL when twice as much document review was performed with the search query compared to TAR 3.0 SAL.

# Highlights from the Northeast eDiscovery & IG Retreat 2018

The 2018 Northeast eDiscovery and Information Governance Retreat was held at the Salamander Resort & Spa in Middleburg, Virginia.  It was a full day of talks with a parallel set of talks on Cybersecurity, Privacy, and Data Protection in the adjacent room. Attendees could attend talks from either track. Below are my notes (certainly not exhaustive) from the eDiscovery and IG sessions. My full set of photos is available here.

Stratagies For Data Minimization Of Legacy Data
Backup and archiving should be viewed as separate functions.  When it comes to spoliation (FRCP Rule 37), reasonableness of the company’s data retention plan is key.  Over preservation is expensive.  There are not many cases on Rule 37 relating to backup tapes.  People are changing their behavior due to the changes in the FRCP, especially in heavily regulated industries such as healthcare and financial services.  Studies find that typically 70% of data has no business value and is not subject to legal hold or retention requirements for compliance.  When using machine learning, you can focus on finding what to keep or what to get rid of.  It is often best to start with unsupervised machine learning.  Be mindful of destructive malware.  To mitigate security risks, it is important to know where your data (including backup tapes) is.  If a backup tape goes missing, do you need to notify customers (privacy)?  To get started, create a matrix showing what you need to keep, keeping in mind legal holds and privacy (GDPR).  Old backup tapes are subject to GDPR.  Does the right to be forgotten apply to backup tapes?  There is currently no answer.  It would be hard to selectively delete data from the tapes, so maybe have a process that deletes during the restore.  There can be conflicts between U.S. ediscovery and GDPR, so you must decide which is the bigger risk.

Preparing A Coordinated Response To Government Inquiries And Investigations

Digging Into TAR
I moderated this panel, so I didn’t take notes. We challenged the audience to create a keyword search that would work better than technology-assisted review. Results are posted here.

Implementing Information Governance – Nightmare On Corporate America Street?

Technology Solution Update From Corporate, Law Firm And Service Provider Perspective
Artificial intelligence (AI) should not merely analyze; it should present a result in a way that is actionable.  It might tell you how much two people talk, their sentiment, and whether there are any spikes in communication volume.  AI can be used by law firms for budgeting by analyzing prior matters.  There are concerns about privacy with AI.  Many clients are moving to the cloud.  Many are using private clouds for collaboration, not necessarily for utilizing large computing power.  Office 365 is of interest to many companies.  There was extensive discussion about the ediscovery analytics capabilities being added from the Equivio acquisition, and a demo by Marcel Katz of Microsoft.  The predictive coding (TAR) capability uses simple active learning (SAL) rather than continuous active learning (CAL).  It is 20 times slower in the cloud than running Equivio on premises.  There is currently no review tool in Office 365, so you have to export the predictions out and do the review elsewhere.  Mobile devices create additional challenges for ediscovery.  The time when a text message is sent may not match the time when it is received if the receiving device is off when the message is sent.  Technology needs to be able to handle emojis.  There are many different apps with many different data storage formats.

The ‘Team Of Teams’ Approach To Enterprise Security And Threat Management
Fast response is critical when you are attacked.  Response must be automated because a human response is not fast enough.  It can take 200 days to detect an adversary on the network, so assume someone is already inside.  What are the critical assets, and what threats should you look for?  What value does the data have to the attacker?  What is the impact on the business?  What is the impact on the people?  Know what is normal for your systems.  Is a large data transfer at 2:00am normal?  Simulate a phishing attack and see if your employees fall for it.  In one case a CEO was known to be in China for a deal, so someone impersonating the CEO emailed the CFO to send $50 million for the deal. The money was never recovered. Have processes in place, like requiring a signature for amounts greater than$10,000.  If a company is doing a lot of acquisitions, it can be hard to know what is on their network.  How should small companies get started?  Change passwords, hire an external auditor, and make use of open source tools.

From Data To GRC Insight
Governance, risk management, and compliance (GRC) needs to become centralized and standardized.  Practicing incident response as a team results in better responses when real incidents happen.  Growing data means growing risk.  Beware of storage of social security numbers and credit card numbers.  Use encryption and limit access based on role.  Detect emailing of spreadsheets full of data.  Know what the cost of HIPAA violations is and assign the risk of non-compliance to an individual.  Learn about the NIST Cybersecurity Framework.  Avoid fines and reputational risk, and improve the organization.  Transfer the risk by having data hosted by a company that provides security.  Cloud and mobile can have big security issues.  The company can’t see traffic on mobile devices to monitor for phishing.

# TAR vs. Keyword Search Challenge, Round 3

This iteration of the challenge, held at the Education Hub at ILTACON 2018, was structured somewhat differently from round 1 and round 2 to give the audience a better chance of beating TAR.  Instead of submitting search queries on paper, participants submitted them through a web form using their phones, which allowed them to repeatedly tweak their queries and resubmit them.  I executed the queries in front of the participants, so they could see the exact recall achieved (since all documents were marked as relevant or non-relevant by a human reviewer in advance) almost instantaneously and they could utilize the performance information for their queries and the queries of other participants to guide improvements to their queries. This actually gave the participants an advantage over what they would experience in a real e-discovery project since performance measurements would normally require human evaluation of a random sample from the search output, which would make execution of several iterations of a query guided by performance evaluations very expensive in terms of review labor.  The audience got those performance evaluations for free even though the goal was to compare recall achieved for equal amounts of document review effort.  On the other hand, the audience did still have the disadvantages of having limited time and no familiarity with the documents.

As before, recall was evaluated for the top 3000 and top 6000 documents, which was enough to achieve high recall with TAR (even with the training documents included, so total review effort for TAR and the search queries was the same).  Audience members were free to work on any of the three topics that were used in previous versions of the challenge: law, medical industry, or biology.  Unfortunately, the audience was much smaller than previous versions of the challenge, and nobody chose to submit a query for the biology topic.

Previously, the TAR results were achieved by using the TAR 3.0 workflow to train with 200 cluster centers, documents were sorted based on the resulting relevance scores, and top-scoring documents were reviewed until the desired amount of review effort was expended without allowing predictions to be updated during that review (e.g., review of 200 training docs plus 2,800 top scoring docs to get the “Top 3,000” result).  I’ll call this TAR 3.0 SAL (SAL = Simple Active Learning, meaning the system is not allowed to learn during the review of top-scoring documents).  In practice you wouldn’t do that.  If you were reviewing top-scoring documents, you would allow the system to continue learning (CAL).  You would use SAL only if you were producing top-scoring documents without reviewing them since allowing learning to continue during the review would reduce the amount of review needed to achieve a desired level of recall.  I used TAR 3.0 SAL in previous iterations because I wanted to simulate the full review in front of the audience in a few seconds and TAR 3.0 CAL would have been slower.  This time, I did the TAR calculations in advance and present both the SAL and CAL results so you can see how much difference the additional learning from CAL made.

One other difference compared to previous versions of the challenge is how I’ve labeled the queries below.  This time, the number indicates which participant submitted the query and the letter indicates which one of his/her queries are being analyzed (if the person submitted more than one) rather than indicating a tweaking of the query that I added to try to improve the result.  In other words, all variations were tweaks done by the audience instead of by me.  Discussion of the results follows the tables, graphs, and queries below.

 Recall Medical Industry Top 3,000 Top 6,000 1a 3.0% 1b 17.4% TAR 3.0 SAL 67.3% 83.7% TAR 3.0 CAL 80.7% 88.5%

 Recall Law Top 3,000 Top 6,000 2 1.0% 3a 36.1% 42.3% 3b 45.3% 60.1% 3c 47.2% 62.6% 4 11.6% 13.8% TAR 3.0 SAL 63.5% 82.3% TAR 3.0 CAL 77.8% 87.8%

1a)  Hospital AND New AND therapies
1b)  Hospital AND New AND (physicians OR doctors)
2)   Copyright AND mickey AND mouse
3a)  Schedule OR Amendments OR Trial OR Jury OR Judge OR Circuit OR Courtroom OR Judgement
3b)  Amendments OR Trial OR Jury OR Judge OR Circuit OR Courtroom OR Judgement OR trial OR law OR Patent OR legal
3c)  Amendments OR Trial OR Jury OR Judge OR Circuit OR Courtroom OR Judgement OR trial OR law OR Patent OR legal OR Plaintiff OR Defendant
4)  Privacy OR (Personally AND Identifiable AND Information) OR PII OR (Protected AND Speech)

TAR won across the board, as in previous iterations of the challenge.  Only one person submitted queries for the medical industry topic.  His/her revised query did a better job of finding relevant documents, but still returned fewer than 3,000 documents and fared far worse than TAR — the query was just not broad enough to achieve high recall.  Three people submitted queries on the law topic.  One of those people revised the query a few times and got decent results (shown in green), but still fell far short of the TAR result, with review of 6,000 documents from the best query finding fewer relevant documents than review of half as many documents with TAR 3.0 SAL (TAR 3.0 CAL did even better).  It is unfortunate that the audience was so small, since a larger audience might have done better by learning from each other’s submissions.  Hopefully I’ll be able to do this with a bigger audience in the future.

# Photos from ILTACON 2018

ILTACON 2018 was held at the Gaylord National Resort & Convention Center in National Harbor, Maryland.  I wasn’t able to attend the sessions (so I don’t have any notes to share) because I was manning the Clustify booth in the exhibit hall, but I did take a lot of photos which you can view here.  The theme for the reception this year was video games, in case you are wondering about the oddly dressed people in some of the photos.