ILTACON 2018 was held at the Gaylord National Resort & Convention Center in National Harbor, Maryland. I wasn’t able to attend the sessions (so I don’t have any notes to share) because I was manning the Clustify booth in the exhibit hall, but I did take a lot of photos which you can view here. The theme for the reception this year was video games, in case you are wondering about the oddly dressed people in some of the photos.
Should proportionality arguments allow producing parties to get away with poor productions simply because they wasted a lot of effort due to an extremely bad algorithm? This article examines one such bad algorithm that has been used in major review platforms, and shows that it could be made vastly more effective with a very minor tweak. Are lawyers who use platforms lacking the tweak committing malpractice by doing so?
Last year I was moderating a panel on TAR (predictive coding) and I asked the audience what recall level they normally aim for when using TAR. An attendee responded that it was a bad question because proportionality only required a reasonable effort. Much of the audience expressed agreement. This should concern everyone. If quality of result (e.g., achieving a certain level of recall) is the goal, the requesting party really has no business asking how the result was achieved–any effort wasted by choosing a bad algorithm is born by the producing party. On the other hand, if the target is expenditure of a certain amount of effort, doesn’t the requesting party have the right to know and object if the producing party has chosen a methodology that is extremely inefficient?
The algorithm I’ll be picking on today is a classifier called 1-nearest neighbor, or 1-NN. You may be using it without ever having heard that name, so pay attention to my description of it and see if it sounds familiar. To predict whether a document is relevant, 1-NN finds the single most similar training document and predicts the relevance of the unreviewed document to be the same. If a relevance score is desired instead of a yes/no relevance prediction, the relevance score can be taken to be the similarity value if the most similar training document is relevant, and it can be taken to be the negative of the similarity value if the most similar training document is non-relevant. Here is a precision-recall curve for the 1-NN algorithm used in a TAR 1.0 workflow trained with randomly-selected documents:
The precision falls off a cliff above 60% recall. This is not due to inadequate training–the cliff shown above will not go away no matter how much training data you add. To understand the implications, realize that if you sort the documents by relevance score and review from the top down until you reach the desired level of recall, 1/P at that recall tells the average number of documents you’ll review for each relevant document you find. At 60% recall, precision is 67%, so you’ll review 1.5 documents (1/0.67 = 1.5) for each relevant document you find. There is some effort wasted in reviewing those 0.5 non-relevant documents for each relevant document you find, but it’s not too bad. If you keep reviewing documents until you reach 70% recall, things get much worse. Precision drops to about 8%, so you’ll encounter so many non-relevant documents after you get past 60% recall that you’ll end up reviewing 12.5 documents for each relevant document you find. You would surely be tempted to argue that proportionality says you should be able to stop at 60% recall because the small gain in result quality of going from 60% recall to 70% recall would cost nearly ten times as much review effort. But does it really have to be so hard to get to 70% recall?
It’s very easy to come up with an algorithm that can reach higher recall without so much review effort once you understand why the performance cliff occurs. When you sort the documents by relevance score with 1-NN, the documents where the most similar training document is relevant will be at the top of the list. The performance cliff occurs when you start digging into the documents where the most similar training document is non-relevant. The 1-NN classifier does a terrible job of determining which of those documents has the best chance of being relevant because it ignores valuable information that is available. Consider two documents, X and Y, that both have a non-relevant training document as the most similar training document, but document X has a relevant training document as the second most similar training document and document Y has a non-relevant training document as the second most similar. We would expect X to have a better chance of being relevant than Y, all else being equal, but 1-NN cannot distinguish between the two because it pays no attention to the second most similar training document. Here is the result for 2-NN, which takes the two most similar training document into account:
Notice that 2-NN easily reaches 70% recall (1/P is 1.6 instead of 12.5), but it does have a performance cliff of its own at a higher level of recall because it fails to make use of information about the third most similar training document. If we utilize information about the 40 most similar training documents we get much better performance as shown by the solid lines here:
It was the presence of non-relevant training documents that tripped up the 1-NN algorithm because the non-relevant training document effectively hid the existence of evidence (similar training documents that were relevant) that a document might be relevant, so you might think the performance cliff could be avoided by omitting non-relevant documents from the training. The result of doing that is shown with dashed lines in the figure above. Omitting non-relevant training documents does help 1-NN at high recall, though it is still far worse than 40-NN with the non-relevant training documents include (omitting the non-relevant training documents actually harms 40-NN, as shown by the red dashed line). A workflow that focuses on reviewing documents that are likely to be relevant, such as TAR 2.0, rather than training with random documents, will be less impacted by 1-NN’s shortcomings, but why would you ever suffer the poor performance of 1-NN when 40-NN requires such a minimal modification of the algorithm?
You might wonder whether the performance cliff shown above is just an anomaly. Here are precision-recall curves for several additional categorization tasks with 1-NN on the left and 40-NN on the right.
Sometimes the 1-NN performance cliff occurs at high enough recall to allow a decent production, but sometimes it keeps you from finding even half of the relevant documents. Should a court accept less than 50% recall when the most trivial tweak to the algorithm could have achieved much higher recall with roughly the same amount of document review?
Of course, there are many factors beyond the quality of the classifier, such as the choice of TAR 1.0 (SPL and SAL), TAR 2.0 (CAL), or TAR 3.0 workflows, that impact the efficiency of the process. The research by Grossman and Cormack that courts have relied upon to justify the use of TAR because it reaches recall that is comparable to or better than an exhaustive human review is based on CAL (TAR 2.0) with good classifiers, whereas some popular software uses TAR 1.0 (less efficient if documents will be reviewed before production) and poor classifiers such as 1-NN. If the producing party vows to reach high recall and bears the cost of choosing bad software and/or processes to achieve that, there isn’t much for the requesting party to complain about (though the producing party could have a bone to pick with an attorney or service provider who recommended an inefficient approach). On the other hand, if the producing party argues that low recall should be tolerated because decent recall would require too much effort, it seems that asking whether the algorithms used are unnecessarily inefficient would be appropriate.
During my presentation at the South Central eDiscovery & IG Retreat I challenged the audience to create keyword searches that would work better than technology-assisted review (predictive coding). This is similar to the experiment done a few months earlier. See this article for more details. The audience again worked in groups to construct keyword searches for two topics. One topic, articles on law, was the same as last time. The other topic, the medical industry, was new (it replaced biology).
Performance was evaluated by comparing the recall achieved for equal amounts of document review effort (the population was fully categorized in advance, so measurements are exact, not estimates). Recall for the top 3000 keyword search matches was compared to recall from reviewing 202 training documents (2 seed documents plus 200 cluster centers using the TAR 3.0 method) and 2798 documents having the highest relevance scores from TAR. Similarly, recall from the top 6000 keyword search matches was compared to recall from review of 6000 documents with TAR. Recall from all documents matching a search query was also measured to find the maximum recall that could be achieved with the query.
The search queries are shown after the performance tables and graphs. When there is an “a” and “b” version of the query, the “a” version was the audience’s query as-is, and the “b” query was tweaked by me to remove restrictions that were limiting the number of relevant documents that could be found. The results are discussed at the end of the article.
|Query||Total Matches||Top 3,000||Top 6,000||All|
|Query||Total Matches||Top 3,000||Top 6,000||All|
1a) medical AND (industry OR business) AND NOT (scientific OR research)
1b) medical AND (industry OR business)
2) (revenue OR finance OR market OR brand OR sales) AND (hospital OR health OR medical OR clinical)
3a) (medical OR hospital OR doctor) AND (HIPPA OR insurance)
3b) medical OR hospital OR doctor OR HIPPA OR insurance
4a) (earnings OR profits OR management OR executive OR recall OR (board AND directors) OR healthcare OR medical OR health OR hospital OR physician OR nurse OR marketing OR pharma OR report OR GlaxoSmithKline OR (united AND health) OR AstraZeneca OR Gilead OR Sanofi OR financial OR malpractice OR (annual AND report) OR provider OR HMO OR PPO OR telemedicine) AND NOT (study OR research OR academic)
4b) earnings OR profits OR management OR executive OR recall OR (board AND directors) OR healthcare OR medical OR health OR hospital OR physician OR nurse OR marketing OR pharma OR report OR GlaxoSmithKline OR (united AND health) OR AstraZeneca OR Gilead OR Sanofi OR financial OR malpractice OR (annual AND report) OR provider OR HMO OR PPO OR telemedicine
5) FRCP OR Fed OR litigation OR appeal OR immigration OR ordinance OR legal OR law OR enact OR code OR statute OR subsection OR regulation OR rules OR precedent OR (applicable AND law) OR ruling
6) judge OR (supreme AND court) OR court OR legislation OR legal OR lawyer OR judicial OR law OR attorney
As before, TAR won across the board, but there were some surprises this time.
For the medical industry topic, review of 3000 documents with TAR achieved higher recall than any keyword search achieved with review of 6000 documents, very similar to results from a few months ago. When all documents matching the medical industry search queries were analyzed, two queries did achieve high recall (3b and 4b, which are queries I tweaked to achieve higher recall), but they did so by retrieving a substantial percentage of the 100,000 document population (16,756 and 58,510 documents respectively). TAR can reach any level of recall by simply taking enough documents from the sorted list—TAR doesn’t run out of matches like a keyword search does. TAR matches the 94.6% recall that query 4b achieved (requiring review of 58,510 documents) with review of only 15,500 documents.
Results for the law topic were more interesting. The two queries submitted for the law topic both performed better than any of the queries submitted for that topic a few months ago. Query 6 gave the best results, with TAR beating it by only a modest amount. If all 25,370 documents matching query 6 were reviewed, 95.7% recall would be achieved, which TAR could accomplish with review of 24,000 documents. It is worth noting that TAR 2.0 would be more efficient, especially at very high recall. TAR 3.0 gives the option to produce documents without review (not utilized for this exercise), plus computations are much faster due to there being vastly fewer training documents, which is handy for simulating a full review live in front of an audience in a few seconds.
The 2018 South Central eDiscovery and Information Governance Retreat was held at Lakeway Resort and Spa, outside of Austin. It was a full day of talks with a parallel set of talks on Cybersecurity, Privacy, and Data Protection in the adjacent room. Attendees could attend talks from either track. Below are my notes (certainly not exhaustive) from the eDiscovery and IG sessions. My full set of photos is available here.
Blowing The Whistle
eDiscovery can be used as a weapon to drive up costs for an adversary. Make requests broad and make the other side reveal what they actually have. Ask for “all communications” rather than “all Office 365 emails” or you may miss something (for example, they may use Slack). The collection may be 1% responsive. How can it be culled defensibly? Ask for broad search terms, get hit rates, and then adjust. The hit rates don’t tell how many documents were actually relevant, so use sampling. When searching for patents, search for “123 patent” instead of just “123” to avoid false positives (patent references often use just the last 3 digits). This rarely happens, but you might get the producing party to disclose top matches for the queries and examine them to give feedback on desired adjustments. You should have a standard specification for the production format you want, and you should get it to the producing party as soon as possible, or you might get 20,000 emails produced in one large PDF that you’ll have to waste time dissecting, and meta data may be lost. If keyword search is used during collection, be aware that Office 365 currently doesn’t OCR non-searchable content, so it will be missed. Demand that the producing party OCR before applying any search terms. In one production there were a lot of “gibberish” emails returned because the search engine was matching “ING” to all words ending in “ing” rather than requiring the full word to match. If ediscovery disputes make it to the judge, it’s usually not a good thing since the judge may not be very technical.
Digging Into TAR
I moderated this panel, so I didn’t take notes. We challenged the audience to create a keyword search that would work better than TAR. Results are posted here.
Beyond eDiscovery – Creating Context By Connecting Disparate Data
Beyond the custodian, who else had access to this file? Who should have access, and who shouldn’t? Forensics can determine who accessed or printed a confidential file. The Windows registry tracks how users access files. When you print, an image is stored. Figure out what else you can do with the tech you have. For example, use Sharepoint workflows to help with ediscovery. Predictive coding can be used with structured data. Favorite quote: “Anyone who says they can solve all of my problems with one tool is a big fat liar.”
Improving Review Efficiency By Maximizing The Use Of Clustering Technology
Clustering can lead to more consistent review by ensuring the same person reviews similar documents and reviews them together. The requesting party can use clustering to get an overview of what they’ve received. Image clustering identifies glyphs to determine document similarity, so it can detect things like a Nike logo, or it can be sensitive to the location on the page where the text occurs. It is important to get the noise (e.g., email footers) out of the data before clustering. Text messages and spreadsheets may cause problems. Clustering can be used for ECA or keyword generation, where it is not making final determinations for a document. It can reveal abbreviations scientists are using for technical terms. It can also be used to identify clusters that can be excluded from review (not relevant). It can be used to prioritize review, with more promising clusters reviewed first. Should you tell the other side you are using clustering to come up with keywords? No, you are just inviting controversy.
Technology Solution Update From Corporate, Law Firm And Service Provider Perspective
Migration to Office 365 and other cloud offerings can cause problems. Data can be dumped into the cloud without tracking where it went. Figuring out how to collect from the cloud can be difficult. Microsoft is always changing Office 365, making it difficult to stay on top of the changes. Favorite quote: “I’m always running to keep up. I should be skinnier, but I’m not.” Office 365 is supposed to have OCR soon. What if the cloud platform gets hacked? There can be throttling issues when collecting from One Drive by downloading everything (not using Microsoft’s tool). Rollout of cloud services should be slow to make sure everyone knows what should be put in the cloud and what shouldn’t, and to ensure that you keep track of where everything is. Be careful about emailing passwords since they may be recorded — use ephemeral communications instead of email for that. Personal devices cause problems because custodians don’t like having their devices collected. Policy is critical, but it is not a cure-all. Policy must be surrounded by communication and re-certification to ensure it is followed. Google mail is not a good solution for restricting data location since attachments are copied to the local disk when they are viewed.
Achieving GDPR Compliance For Unstructured Content
Some technology was built for GDPR while other tech was build for some other purpose like ediscovery and tweaked for GDPR, so be careful. For example. you don’t want to have to collect the data before determining whether it contains PII. The California privacy law taking effect in 2020 is similar to GDPR, so U.S. companies cannot ignore the issue. Backup tapes should be deleted after 90 days. They are for emergencies, not retention. Older backups often don’t work (e.g., referenced network addresses are no longer valid).
Escalating Cyber Risk From The IT Department To The Boardroom
One very effective way to change a company’s culture with respect to security is to break people up into white vs. black teams and hold war games where one team attacks and the other tries to come up with the best way to defend against it. You need to point out both the risk and how to fix it to get the board’s attention. Show the board a graph with the expected value lost in a breach on the vertical axis and cost to eliminate the risk on the horizontal axis — points lying above the 45 degree line are risks that should be eliminated (doing so saves money). On average, a server breach costs 28% of operating costs. Investors may eventually care if someone on the board has a security certification. It is OK to question directors, but don’t call out their b.s.. The Board cares most about what the CEO and CFO are saying. Ethical problems tend to happen when things are too siloed.