TAR, Proportionality, and Bad Algorithms (1-NN)

Leave a reply

Should proportionality arguments allow producing parties to get away with poor productions simply because they wasted a lot of effort due to an extremely bad algorithm? This article examines one such bad algorithm that has been used in major review platforms, and shows that it could be made vastly more effective with a very minor tweak. Are lawyers who use platforms lacking the tweak committing malpractice by doing so?

Last year I was moderating a panel on TAR (predictive coding) and I asked the audience what recall level they normally aim for when using TAR. An attendee responded that it was a bad question because proportionality only required a reasonable effort. Much of the audience expressed agreement. This should concern everyone. If quality of result (e.g., achieving a certain level of recall) is the goal, the requesting party really has no business asking how the result was achieved–any effort wasted by choosing a bad algorithm is born by the producing party. On the other hand, if the target is expenditure of a certain amount of effort, doesn’t the requesting party have the right to know and object if the producing party has chosen a methodology that is extremely inefficient?

The algorithm I’ll be picking on today is a classifier called 1-nearest neighbor, or 1-NN. You may be using it without ever having heard that name, so pay attention to my description of it and see if it sounds familiar. To predict whether a document is relevant, 1-NN finds the single most similar training document and predicts the relevance of the unreviewed document to be the same. If a relevance score is desired instead of a yes/no relevance prediction, the relevance score can be taken to be the similarity value if the most similar training document is relevant, and it can be taken to be the negative of the similarity value if the most similar training document is non-relevant. Here is a precision-recall curve for the 1-NN algorithm used in a TAR 1.0 workflow trained with randomly-selected documents:

knn_1

The precision falls off a cliff above 60% recall. This is not due to inadequate training–the cliff shown above will not go away no matter how much training data you add. To understand the implications, realize that if you sort the documents by relevance score and review from the top down until you reach the desired level of recall, 1/P at that recall tells the average number of documents you’ll review for each relevant document you find. At 60% recall, precision is 67%, so you’ll review 1.5 documents (1/0.67 = 1.5) for each relevant document you find. There is some effort wasted in reviewing those 0.5 non-relevant documents for each relevant document you find, but it’s not too bad. If you keep reviewing documents until you reach 70% recall, things get much worse. Precision drops to about 8%, so you’ll encounter so many non-relevant documents after you get past 60% recall that you’ll end up reviewing 12.5 documents for each relevant document you find. You would surely be tempted to argue that proportionality says you should be able to stop at 60% recall because the small gain in result quality of going from 60% recall to 70% recall would cost nearly ten times as much review effort. But does it really have to be so hard to get to 70% recall?

It’s very easy to come up with an algorithm that can reach higher recall without so much review effort once you understand why the performance cliff occurs. When you sort the documents by relevance score with 1-NN, the documents where the most similar training document is relevant will be at the top of the list. The performance cliff occurs when you start digging into the documents where the most similar training document is non-relevant. The 1-NN classifier does a terrible job of determining which of those documents has the best chance of being relevant because it ignores valuable information that is available. Consider two documents, X and Y, that both have a non-relevant training document as the most similar training document, but document X has a relevant training document as the second most similar training document and document Y has a non-relevant training document as the second most similar. We would expect X to have a better chance of being relevant than Y, all else being equal, but 1-NN cannot distinguish between the two because it pays no attention to the second most similar training document. Here is the result for 2-NN, which takes the two most similar training document into account:

knn_2

Notice that 2-NN easily reaches 70% recall (1/P is 1.6 instead of 12.5), but it does have a performance cliff of its own at a higher level of recall because it fails to make use of information about the third most similar training document. If we utilize information about the 40 most similar training documents we get much better performance as shown by the solid lines here:

knn_40

It was the presence of non-relevant training documents that tripped up the 1-NN algorithm because the non-relevant training document effectively hid the existence of evidence (similar training documents that were relevant) that a document might be relevant, so you might think the performance cliff could be avoided by omitting non-relevant documents from the training. The result of doing that is shown with dashed lines in the figure above. Omitting non-relevant training documents does help 1-NN at high recall, though it is still far worse than 40-NN with the non-relevant training documents include (omitting the non-relevant training documents actually harms 40-NN, as shown by the red dashed line). A workflow that focuses on reviewing documents that are likely to be relevant, such as TAR 2.0, rather than training with random documents, will be less impacted by 1-NN’s shortcomings, but why would you ever suffer the poor performance of 1-NN when 40-NN requires such a minimal modification of the algorithm?

You might wonder whether the performance cliff shown above is just an anomaly. Here are precision-recall curves for several additional categorization tasks with 1-NN on the left and 40-NN on the right.

1nn_vs_40nn_several_tasks

Sometimes the 1-NN performance cliff occurs at high enough recall to allow a decent production, but sometimes it keeps you from finding even half of the relevant documents. Should a court accept less than 50% recall when the most trivial tweak to the algorithm could have achieved much higher recall with roughly the same amount of document review?

Of course, there are many factors beyond the quality of the classifier, such as the choice of TAR 1.0 (SPL and SAL), TAR 2.0 (CAL), or TAR 3.0 workflows, that impact the efficiency of the process. The research by Grossman and Cormack that courts have relied upon to justify the use of TAR because it reaches recall that is comparable to or better than an exhaustive human review is based on CAL (TAR 2.0) with good classifiers, whereas some popular software uses TAR 1.0 (less efficient if documents will be reviewed before production) and poor classifiers such as 1-NN. If the producing party vows to reach high recall and bears the cost of choosing bad software and/or processes to achieve that, there isn’t much for the requesting party to complain about (though the producing party could have a bone to pick with an attorney or service provider who recommended an inefficient approach). On the other hand, if the producing party argues that low recall should be tolerated because decent recall would require too much effort, it seems that asking whether the algorithms used are unnecessarily inefficient would be appropriate.

TAR vs. Keyword Search Challenge, Round 2

3 Replies

During my presentation at the South Central eDiscovery & IG Retreat I challenged the audience to create keyword searches that would work better than technology-assisted review (predictive coding). This is similar to the experiment done a few months earlier. See this article for more details. The audience again worked in groups to construct keyword searches for two topics. One topic, articles on law, was the same as last time. The other topic, the medical industry, was new (it replaced biology).

Performance was evaluated by comparing the recall achieved for equal amounts of document review effort (the population was fully categorized in advance, so measurements are exact, not estimates). Recall for the top 3000 keyword search matches was compared to recall from reviewing 202 training documents (2 seed documents plus 200 cluster centers using the TAR 3.0 method) and 2798 documents having the highest relevance scores from TAR. Similarly, recall from the top 6000 keyword search matches was compared to recall from review of 6000 documents with TAR. Recall from all documents matching a search query was also measured to find the maximum recall that could be achieved with the query.

The search queries are shown after the performance tables and graphs. When there is an “a” and “b” version of the query, the “a” version was the audience’s query as-is, and the “b” query was tweaked by me to remove restrictions that were limiting the number of relevant documents that could be found. The results are discussed at the end of the article.

Medical Industry		Recall
Query	Total Matches	Top 3,000	Top 6,000	All
1a	1,618	14.4%		14.4%
1b	3,882	32.4%	40.6%	40.6%
2	7,684	30.3%	42.2%	46.6%
3a	1,714	22.4%		22.4%
3b	16,756	32.7%	44.6%	71.1%
4a	33,925	15.3%	20.3%	35.2%
4b	58,510	27.9%	40.6%	94.5%
TAR		67.3%	83.7%

Law		Recall
Query	Total Matches	Top 3,000	Top 6,000	All
5	36,245	38.8%	56.4%	92.3%
6	25,370	51.9%	72.4%	95.7%
TAR		63.5%	82.3%

1a) medical AND (industry OR business) AND NOT (scientific OR research)
1b) medical AND (industry OR business)
2) (revenue OR finance OR market OR brand OR sales) AND (hospital OR health OR medical OR clinical)
3a) (medical OR hospital OR doctor) AND (HIPPA OR insurance)
3b) medical OR hospital OR doctor OR HIPPA OR insurance
4a) (earnings OR profits OR management OR executive OR recall OR (board AND directors) OR healthcare OR medical OR health OR hospital OR physician OR nurse OR marketing OR pharma OR report OR GlaxoSmithKline OR (united AND health) OR AstraZeneca OR Gilead OR Sanofi OR financial OR malpractice OR (annual AND report) OR provider OR HMO OR PPO OR telemedicine) AND NOT (study OR research OR academic)
4b) earnings OR profits OR management OR executive OR recall OR (board AND directors) OR healthcare OR medical OR health OR hospital OR physician OR nurse OR marketing OR pharma OR report OR GlaxoSmithKline OR (united AND health) OR AstraZeneca OR Gilead OR Sanofi OR financial OR malpractice OR (annual AND report) OR provider OR HMO OR PPO OR telemedicine
5) FRCP OR Fed OR litigation OR appeal OR immigration OR ordinance OR legal OR law OR enact OR code OR statute OR subsection OR regulation OR rules OR precedent OR (applicable AND law) OR ruling
6) judge OR (supreme AND court) OR court OR legislation OR legal OR lawyer OR judicial OR law OR attorney

As before, TAR won across the board, but there were some surprises this time.

For the medical industry topic, review of 3000 documents with TAR achieved higher recall than any keyword search achieved with review of 6000 documents, very similar to results from a few months ago. When all documents matching the medical industry search queries were analyzed, two queries did achieve high recall (3b and 4b, which are queries I tweaked to achieve higher recall), but they did so by retrieving a substantial percentage of the 100,000 document population (16,756 and 58,510 documents respectively). TAR can reach any level of recall by simply taking enough documents from the sorted list—TAR doesn’t run out of matches like a keyword search does. TAR matches the 94.6% recall that query 4b achieved (requiring review of 58,510 documents) with review of only 15,500 documents.

Results for the law topic were more interesting. The two queries submitted for the law topic both performed better than any of the queries submitted for that topic a few months ago. Query 6 gave the best results, with TAR beating it by only a modest amount. If all 25,370 documents matching query 6 were reviewed, 95.7% recall would be achieved, which TAR could accomplish with review of 24,000 documents. It is worth noting that TAR 2.0 would be more efficient, especially at very high recall. TAR 3.0 gives the option to produce documents without review (not utilized for this exercise), plus computations are much faster due to there being vastly fewer training documents, which is handy for simulating a full review live in front of an audience in a few seconds.

Highlights from the South Central eDiscovery & IG Retreat 2018

Leave a reply

The 2018 South Central eDiscovery and Information Governance Retreat was held at Lakeway Resort and Spa, outside of Austin. It was a full day of talks with a parallel set of talks on Cybersecurity, Privacy, and Data Protection in the adjacent room. Attendees could attend talks from either track. Below are my notes (certainly not exhaustive) from the eDiscovery and IG sessions. My full set of photos is available here.

Blowing The Whistle
eDiscovery can be used as a weapon to drive up costs for an adversary. Make requests broad and make the other side reveal what they actually have. Ask for “all communications” rather than “all Office 365 emails” or you may miss something (for example, they may use Slack). The collection may be 1% responsive. How can it be culled defensibly? Ask for broad search terms, get hit rates, and then adjust. The hit rates don’t tell how many documents were actually relevant, so use sampling. When searching for patents, search for “123 patent” instead of just “123” to avoid false positives (patent references often use just the last 3 digits). This rarely happens, but you might get the producing party to disclose top matches for the queries and examine them to give feedback on desired adjustments. You should have a standard specification for the production format you want, and you should get it to the producing party as soon as possible, or you might get 20,000 emails produced in one large PDF that you’ll have to waste time dissecting, and meta data may be lost. If keyword search is used during collection, be aware that Office 365 currently doesn’t OCR non-searchable content, so it will be missed. Demand that the producing party OCR before applying any search terms. In one production there were a lot of “gibberish” emails returned because the search engine was matching “ING” to all words ending in “ing” rather than requiring the full word to match. If ediscovery disputes make it to the judge, it’s usually not a good thing since the judge may not be very technical.

Digging Into TAR
I moderated this panel, so I didn’t take notes. We challenged the audience to create a keyword search that would work better than TAR. Results are posted here.

Beyond eDiscovery – Creating Context By Connecting Disparate Data
Beyond the custodian, who else had access to this file? Who should have access, and who shouldn’t? Forensics can determine who accessed or printed a confidential file. The Windows registry tracks how users access files. When you print, an image is stored. Figure out what else you can do with the tech you have. For example, use Sharepoint workflows to help with ediscovery. Predictive coding can be used with structured data. Favorite quote: “Anyone who says they can solve all of my problems with one tool is a big fat liar.”

Improving Review Efficiency By Maximizing The Use Of Clustering Technology
Clustering can lead to more consistent review by ensuring the same person reviews similar documents and reviews them together. The requesting party can use clustering to get an overview of what they’ve received. Image clustering identifies glyphs to determine document similarity, so it can detect things like a Nike logo, or it can be sensitive to the location on the page where the text occurs. It is important to get the noise (e.g., email footers) out of the data before clustering. Text messages and spreadsheets may cause problems. Clustering can be used for ECA or keyword generation, where it is not making final determinations for a document. It can reveal abbreviations scientists are using for technical terms. It can also be used to identify clusters that can be excluded from review (not relevant). It can be used to prioritize review, with more promising clusters reviewed first. Should you tell the other side you are using clustering to come up with keywords? No, you are just inviting controversy.

Technology Solution Update From Corporate, Law Firm And Service Provider Perspective
Migration to Office 365 and other cloud offerings can cause problems. Data can be dumped into the cloud without tracking where it went. Figuring out how to collect from the cloud can be difficult. Microsoft is always changing Office 365, making it difficult to stay on top of the changes. Favorite quote: “I’m always running to keep up. I should be skinnier, but I’m not.” Office 365 is supposed to have OCR soon. What if the cloud platform gets hacked? There can be throttling issues when collecting from One Drive by downloading everything (not using Microsoft’s tool). Rollout of cloud services should be slow to make sure everyone knows what should be put in the cloud and what shouldn’t, and to ensure that you keep track of where everything is. Be careful about emailing passwords since they may be recorded — use ephemeral communications instead of email for that. Personal devices cause problems because custodians don’t like having their devices collected. Policy is critical, but it is not a cure-all. Policy must be surrounded by communication and re-certification to ensure it is followed. Google mail is not a good solution for restricting data location since attachments are copied to the local disk when they are viewed.

Achieving GDPR Compliance For Unstructured Content
Some technology was built for GDPR while other tech was build for some other purpose like ediscovery and tweaked for GDPR, so be careful. For example. you don’t want to have to collect the data before determining whether it contains PII. The California privacy law taking effect in 2020 is similar to GDPR, so U.S. companies cannot ignore the issue. Backup tapes should be deleted after 90 days. They are for emergencies, not retention. Older backups often don’t work (e.g., referenced network addresses are no longer valid).

Escalating Cyber Risk From The IT Department To The Boardroom
One very effective way to change a company’s culture with respect to security is to break people up into white vs. black teams and hold war games where one team attacks and the other tries to come up with the best way to defend against it. You need to point out both the risk and how to fix it to get the board’s attention. Show the board a graph with the expected value lost in a breach on the vertical axis and cost to eliminate the risk on the horizontal axis — points lying above the 45 degree line are risks that should be eliminated (doing so saves money). On average, a server breach costs 28% of operating costs. Investors may eventually care if someone on the board has a security certification. It is OK to question directors, but don’t call out their b.s.. The Board cares most about what the CEO and CFO are saying. Ethical problems tend to happen when things are too siloed.

TAR vs. Keyword Search Challenge

3 Replies

During my presentation at the NorCal eDiscovery & IG Retreat I challenged the audience to create keyword searches that would work better than technology-assisted review (predictive coding) for two topics. Half of the room was tasked with finding articles about biology (science-oriented articles, excluding medical treatment) and the other half searched for articles about current law (excluding proposed laws or politics). I ran one of the searches against TAR in Clustify live during the presentation (Clustify’s “shadow tags” feature allows a full document review to be simulated in a few minutes using documents that were pre-categorized by human reviewers), but couldn’t do the rest due to time constraints. This article presents the results for all the queries submitted by the audience.

The audience had limited time to construct queries (working together in groups), they weren’t familiar with the data set, and they couldn’t do sampling to tune their queries, so I’m not claiming the exercise was comparable to an e-discovery project. Still, it was entertaining. The topics are pretty simple, so a large percentage of the relevant documents can be found with a pretty simple search using some broad terms. For example, a search for “biology” would find 37% of the biology documents. A search for “law” would find 71% of the law articles. The trick is to find the relevant documents without pulling in too many of the non-relevant ones.

To evaluate the results, I measured the recall (percentage of relevant documents found) from the top 3,000 and top 6,000 hits on the search query (3% and 6% of the population respectively). I’ve also included the recall achieved by looking at all docs that matched the search query, just to see what recall the search queries could achieve if you didn’t worry about pulling in a ton of non-relevant docs. For the TAR results I used TAR 3.0 trained with two seed documents (one relevant from a keyword search and one random non-relevant document) followed by 20 iterations of 10 top-scoring cluster centers, so a total of 202 training documents (no control set needed with TAR 3.0). To compare to the top 3,000 search query matches, the 202 training documents plus 2,798 top-scoring documents were used for TAR, so the total document review (including training) would be the same for TAR and the search query.

The search engine in Clustify is intended to help the user find a few seed documents to get active learning started, so it has some limitations. If the audience’s search query included phrases, they were converted an AND search enclosed in parenthesis. If the audience’s query included a wildcard, I converted it to a parenthesized OR search by looking at the matching words in the index and selecting only the ones that made sense (i.e., I made the queries better than they would have been with an actual wildcard). I noticed that there were a lot of irrelevant words that matched the wildcards. For example, “cell*” in a biology search should match cellphone, cellular, cellar, cellist, etc., but I excluded such words. I would highly recommend that people using keyword search check to see what their wildcards are actually matching–you may be pulling in a lot of irrelevant words. I removed a few words from the queries that weren’t in the index (so the words shown all actually had an impact). When there is an “a” and “b” version of the query, the “a” version was the audience’s query as-is, and the “b” query was tweaked by me to retrieve more documents.

The tables below show the results. The actual queries are displayed below the tables. Discussion of the results is at the end.

Biology		Recall
Query	Total Matches	Top 3,000	Top 6,000	All Matches
1	4,407	34.0%	47.2%	47.2%
2	13,799	37.3%	46.0%	80.9%
3	25,168	44.3%	60.9%	87.8%
4a	42	0.5%		0.5%
4b	2,283	20.9%		20.9%
TAR		72.1%	91.0%

Law		Recall
Query	Total Matches	Top 3,000	Top 6,000	All Matches
5a	2,914	35.8%		35.8%
5b	9,035	37.2%	49.3%	60.6%
6	534	2.9%		2.9%
7	27,288	32.3%	47.1%	79.1%
TAR		62.3%	80.4%

1) organism OR microorganism OR species OR DNA

2) habitat OR ecology OR marine OR ecosystem OR biology OR cell OR organism OR species OR photosynthesis OR pollination OR gene OR genetic OR genome AND NOT (treatment OR generic OR prognosis OR placebo OR diagnosis OR FDA OR medical OR medicine OR medication OR medications OR medicines OR medicated OR medicinal OR physician)

3) biology OR plant OR (phyllis OR phylos OR phylogenetic OR phylogeny OR phyllo OR phylis OR phylloxera) OR animal OR (cell OR cells OR celled OR cellomics OR celltiter) OR (circulation OR circulatory) OR (neural OR neuron OR neurotransmitter OR neurotransmitters OR neurological OR neurons OR neurotoxic OR neurobiology OR neuromuscular OR neuroscience OR neurotransmission OR neuropathy OR neurologically OR neuroanatomy OR neuroimaging OR neuronal OR neurosciences OR neuroendocrine OR neurofeedback OR neuroscientist OR neuroscientists OR neurobiologist OR neurochemical OR neuromorphic OR neurohormones OR neuroscientific OR neurovascular OR neurohormonal OR neurotechnology OR neurobiologists OR neurogenetics OR neuropeptide OR neuroreceptors) OR enzyme OR blood OR nerve OR brain OR kidney OR (muscle OR muscles) OR dna OR rna OR species OR mitochondria

4a) statistically AND ((laboratory AND test) OR species OR (genetic AND marker) OR enzyme) AND NOT (diagnosis OR treatment OR prognosis)

4b) (species OR (genetic AND marker) OR enzyme) AND NOT (diagnosis OR treatment OR prognosis)

5a) federal AND (ruling OR judge OR justice OR (appellate OR appellant))

5b) ruling OR judge OR justice OR (appellate OR appellant)

6) amendments OR FRE OR whistleblower

7) ((law OR laws OR lawyer OR lawyers OR lawsuit OR lawsuits OR lawyering) OR (regulation OR regulations) OR (statute OR statutes) OR (standards)) AND NOT pending

TAR beat keyword search across the board for both tasks. The top 3,000 documents returned by TAR achieved higher recall than the top 6,000 documents for any keyword search. In other words, if documents will be reviewed before production, TAR achieves better results (higher recall) with half as much document review compared to any of the keyword searches. The top 6,000 documents returned by TAR achieved higher recall than all of the documents matching any individual keyword search, even when the keyword search returned 27,000 documents.

Similar experiments were performed later with many similarities but also some notable differences in the structure of the challenge and the results. You can read about them here: round 2, round 3, round 4, round 5, and round 6.

Highlights from Ipro Innovations 2018

Leave a reply

The 17th annual Ipro Innovations conference was held at the Talking Stick Resort. It was well-organized with two and a half days of informative talks and fun activities. Early in the day everyone met in a large hall for the talks, whereas there were seven simultaneous breakout sessions later in the day. There were many sessions in computer labs where attendees could gain first-hand experience with the Ipro software. I could only attend the tail end of the conference because I was at the NorCal eDiscovery & IG Retreat earlier in the week. I’ve included my notes below. You can find my full set of photos here.

The keynote on the final day was delivered by Afterburner, a consulting firm promoting a “Flawless Execution” methodology based on military strategy. Their six steps of mission planning are: 1) determine the mission objective, 2) identify the threats, 3) identify your available and required resources, 4) evaluate lessons learned, 5) develop a course of action, and 6) plan for contingencies. The audience participated in exercises to illustrate how easily attention can be channelized, meaning that you focus on one thing at the expense of everything else. Channelized attention was the cause of a commercial airliner crash. To avoid being distracted by minor things (deadlines, cost, etc.), keep track of what it is most important to pay attention to (customers).

Tiana Van Dyk described her firm’s 1.5 year transition from Summation to Ipro’s Eclipse, including moving 325 cases over. Substantial time and preparation are needed to avoid problems and overcome resistance to change. Staff should not be allowed to access the new system without undergoing training. Case studies are useful to convince people to use new analytics tools. Start small with new analytics tools (email threading and near-dupe), then use clustering to remove some junk (football and LinkedIn emails), and finally TAR. Use sampling to demonstrate that things are working. Learn everything you can about the technology you have. Missteps can set you back terribly, causing bad rumors and fear. Continuous communication is important to minimize panic when there is a problem.

There were also talks on new functionality in the Ipro software. I gave a short presentation on how Ipro’s transition to the Clustify engine would improve TAR. There were several opportunities for Ipro customers to give feedback about the functionality they would like to see.

Highlights from the NorCal eDiscovery & IG Retreat 2018

Leave a reply

The 2018 NorCal eDiscovery & IG Retreat was held at the Carmel Valley Ranch, location of the first Ing3nious retreat in 2011 (though the company wasn’t called Ing3nious at the time). It was a full day of talks with a parallel set of talks on Cybersecurity, Privacy, and Data Protection in the adjacent room. Attendees could attend talks from either track. Below are my notes (certainly not exhaustive) from the eDiscovery and IG sessions. My full set of photos is available here.

Digging Into TAR
I moderated this panel, so I didn’t take notes. We challenged the audience to create a keyword search that would work better than TAR. Results are posted here.

Information Governance In The Age Of Encryption And Ephemeral Communications
Facebook messenger has an ephemeral mode, though it is currently only available to Facebook executives. You can be forced to decrypt data (despite the 5th Amendment) if it can be proven that you have the password. Ephemeral communication is a replacement for in-person communication, but it can look bad (like you have something to hide). 53% of email is read on mobile devices, but personal devices often aren’t collected. Slack is useful for passing institutional knowledge along to new employees, but general counsel wants things deleted after 30 days. Some ephemeral communication tools have archiving options. You may want to record some conversations in email–you may need them as evidence in the future. Are there unencrypted copies of encrypted data in some locations?

Blowing The Whistle
eDiscovery can be used as a weapon to drive up costs for an adversary. The plaintiff should be skeptical about about claims of burden–has appropriate culling been performed? Do a meet and confer as early as possible. Examine data for a few custodians and see if more are needed. A data dump is when a lot of non-relevant docs are produced (e.g., due to a broad search or a search that matches an email signature). Do sampling to test search terms. Be explicit about what production formatting you want (e.g., searchable PDF, color, meta data).

Emerging Technology And The Impact On eDiscovery
There may be a lack of policy for new data sources. Text messages and social media are becoming relevant for more cases. Your Facebook info can be accessed through your friends. Fitbit may show whether the person could have committed the murder. IP addresses can reveal whether email was sent from home or work. The change to the Twitter character limit may break some collection tools–QC early on to detect such problems. Vendors should have multiple tools. Communicate about what tech is involved and what you need to collect.

Technology Solution Update From Corporate, Law Firm And Service Provider Perspective
Cloud computing (infrastructure, storage, productivity, and web apps) will cause conflict between EU privacy law and US discovery. AWS provides lots of security options, but it can be difficult to get right (must be configured correctly). Startups aim to build fast and don’t think enough about how to get the data out. Are law firm clients looking at cloud agreements and how to export data? Free services (Facebook, Gmail, etc.) spy on users, which makes them inappropriate for corporate use where privacy is needed. Slack output is one long conversation. What about tools that provide a visualization? You may need the data, not just a screenshot. Understand the limit of repositories–Office 365 limits to 10GB of PST at a time. What about versioning storage? It is becoming more common as storage prices decline. Do you need to collect all versions of a document? “Computer ate my homework” excuses don’t fare well in court (e.g., production of privileged docs due to a bad mouse click, or missing docs matching a keyword search because they weren’t OCRed). GDPR requires knowing where the users are (not where the data is stored). Employees don’t want their private phones collected, so sandbox work stuff.

Employing Intelligence – Both Human And Artificial (AI) – To Reduce Overall eDiscovery Costs
You need to talk to custodians–the org chart doesn’t really tell you what you need to know. Search can show who communicates with whom about a topic. To discover that a custodian is involved that is not known to the attorney, look at the data and interview the ground troops. Look for a period when there is a lack of communication. Use sentiment analysis (including emojis). Watch for strange bytes in the review tool–they may be emojis that can only be viewed in the original app. Automate legal holds as much as possible. Escalate to a manager if the employee doesn’t respond to the hold in a timely manner. Filter on meta data to reduce the amount that goes into the load file. Sometimes things go wrong with the software (trained on biased data, not finding relevant spreadsheets, etc.). QC to ensure the human element doesn’t fail. Use phonetic search on audio files instead of transcribing before search. Analyze data as it comes in–you may spot months of missing email. Do proof of concept when selecting tools.

Practical Discussion: eDiscovery Process With Law Firms, In-House And Vendor
Stick with a single vendor so you know it is done the same way every time. Figure out what your data sources are. Get social media data into the review platform in a usable form (e.g., Skype). Finding the existence of cloud data stores requires effort. How long is the cloud data being held (Twitter only holds the last 100 direct messages)? The company needs to provide the needed apps so employees aren’t tempted to go outside to get what they need.

Highlights from the SoCal eDiscovery & IG Retreat 2017

Leave a reply

The 2017 SoCal eDiscovery & IG Retreat was held at the Pelican Hill Resort in Newport Coast, California. The format was somewhat different from other recent Ing3nious retreats, having at single session at a time instead of two sessions in parallel. My notes below provide some highlights. I’ve posted my full set of photos from the conference and nearby Crystal Cove here.

How Well Can Your Organization Protect Against Encrypted Traffic Threats?
Companies should be concerned about encrypted traffic, because they don’t know what is leaving their network. Get encryption keys for cloud services the company uses so you can check outgoing content and block all other encrypted traffic — if something legitimate breaks, employees will let you know. It is important to distinguish personal Drop Box use from corporate use. Make sure you have a policy that says all corporate devices are subject to inspection and monitoring. The CSO should report to the CEO rather than the CIO or too much ends up being spent on technology with too little spent on risk reduction. Security tech must be kept up to date. Some security vendors are using artificial intelligence. The board of directors needs to be educated about their fiduciary duty to provide oversight for security, as established in a 1996 case in Delaware (see this article). In what country is the backup of your cloud data stored? That could be important during litigation. The amount of unstructured data companies have can be surprising, and represents additional risk. When the CSO reports to the board, he/she should speak in terms of risk (don’t use tech speak). Build in security from the beginning when starting new projects. GDPR violations could bring penalties of up to 4% of revenue. Guidance papers on GDPR are all over 40 pages long. “Internet of Things” devices (e.g., refrigerators) are typically not secure. Use DNS to detect attempts by IoT devices to call out. IoT is collecting data about you to sell. The book Future Crimes by Marc Goodman was recommended.

Using Technology To Reduce eDiscovery Spend
Artificial intelligence (AI) can be used before collection to reduce data volume. Have a conversation about what’s really needed and use ECA to cull by date, topic, etc. Process data from key players first. It is important for project managers to know the data. Parse out domain names, see who is talking to whom, see which folders people really have access to, and get rid of bad file types. Image the machine of the person who will be leaving, then tell them you will be imaging the machine in the near future and see what they delete. Use sentiment analysis and see if sentiment changes over time. Use clustering to identify stuff that can be culled (e.g., stuff about the NFL). Use clustering, rather than random sampling, to see what the data looks like. Redaction of things like social security numbers can be automated.

It’s All Greek To Me: Multi-Language Document Review from Shakespeare To FCPA
Examples were given of Craigslist ads seeking temporary people for foreign language document review, showing that companies performing such reviews may not have capable people on staff. Law firms are relying on external providers to manage reviews in languages in which they are not fluent. English in Singapore is not the same as English in the U.S. (different expressions) — cultural context is important. There are 6,900 languages around the world. Law firms must do diligence to ensure a language expert is trustworthy. Law firms don’t like being beta testers for technologies like TAR and machine translation. Communications in Asia are often not in text file format (e.g., chat applications) and can involve hundreds of thousands of non-standard emojis (how to even render them?). Facebook got a Palestinian man arrested by mistranslating his “good morning” to “attack them” (see this article). One speaker suggested Googling “fraudulent foreign language reviewers” (the top match is here). There was skepticism about the ALTA language proficiency test.

Artificial Intelligence – Facial Expression Analytics As A Competitive Advantage In Risk Mitigation
Monitoring emotional response can provide an advantage at trial. Universal emotions: joy, sadness, surprise, fear, anger, disgust, and contempt. The lawyer should avoid causing sadness since that is detrimental to being liked — let the witness do it. Emotional response can depend on demographics. For example, the contempt response depends on age, and women tend to show a larger fear response. Software can now detect emotion from facial photos very quickly. One panelist warned against using the iPhone X’s authentication via face recognition because Apple has software for detecting emotion and could monitor your mood. 80% of what a jury picks up on is non-verbal. Analyze video of depositions to look for ways to improve. Senior people at companies refuse to believe they don’t come across well, but they often show signs of contempt at questions they deem to be petty. There is no facial expression for deception — look for a shift in expression. Realize that software may not be making decisions in the same way as a human would. For example, a neural network that did a good job of distinguishing wolves from dogs was actually making the decision based on the presence or absence of snow in the background.

TAR: What Have We Learned?
I moderated this panel, so I didn’t take notes.

Bridging The Gap Between Inside And Outside Counsel: Next Generation Strategies For Collaborating On Complex Litigation Matters
Communicate about what you actually need or they may collect everything regardless of date or custodian, resulting in high costs for hosting. Insourcing is a trend — the company keeps the data in house (reduce cost and risk) and provides outside counsel with access. This means imposing technology on the outside counsel. One benefit of insourcing is that in house counsel learns about the data, which may help with future cases. Another trend is disaggregation, where legal tasks are split up among different law firms instead of using a single firm for everything. It is important to ensure that technologies being used are understood by all parties from the start to avoid big problems later. Paralegals can be good at keeping communication flowing between the outside attorney and the client. Tech companies that want people to adopt their products need to help outside counsel explain the benefits to clients.

Cyber And Data Security For The GC: How To Stay Out Of Headlines And Crosshairs
I couldn’t attend this panel because I had to catch my flight.

Highlights from the NorCal IG Retreat 2017

Leave a reply

The 2017 NorCal Information Governance Retreat was held by Ing3nious at the Quail Lodge & Golf Club in Carmel Valley, California. After round table discussions, the retreat featured two simultaneous sessions throughout the day. My notes below provide some highlights from the sessions I was able to attend. I’ve posted additional photos here.

The intro to the round table discussions included some comments on the evolution of the Internet, the importance of searching for obscenities to find critical documents or to identify data that has been scrubbed (it is implausible that there are no emails containing obscenities for a failing project), the difficulty of searching for “IT” (meaning information technology rather than the pronoun), and the inability of many tools to search for emojis.

TAR: What Have We Learned?
I moderated this panel, so I didn’t take notes.

How Well Can Your Organization Protect Against Encrypted Traffic Threats?
I couldn’t attend this

IG Analytics And Infonomics: The Future Is Now
I couldn’t attend this

Breaches Happen. Going On The Cyber Offense With Deception
Breach stories that were mentioned included Equifax, Target, an employee that built their own (insecure) tunnel to get data out to their home, and an employee that carried data out on a microSD card. In the RSA / Lockheed Martin breach, a Lockheed contractor was fooled by a phishing email, illustrating how hard it is to keep attackers out. Email is a very common source of breaches. A big mistake is not knowing that you’ve been breached. People put honeypots outside the firewall to detect attacks. It’s better to use deception technology, which puts decoys inside the firewall.

Social Media And Website Information Governance
There has been some regulation of social media, especially for certain industries. The SEC in 2012 required financial institutions to archive it. The FTC has been enforcing paid endorsement disclosure guidelines (e.g., Kim Kardashian’s endorsement of a morning sickness drug). Collecting evidence from social media is tricky. A screenshot could be photoshopped, so how to prove it is legitimate? Should collect a screenshot, source code, meta data, and a digital signature with time stamp. Corporate policy on social media use will depend on the kind of company and the industry it is in. There should also be a policy on monitoring employee’s social media use. Companies using an internal social media system are asking for problems. How will they police/discipline improper usage? If an employee posts “Why haven’t I seen John lately?” and another replies that John has cancer, you have a problem. Does a company social media system really improve productivity? Can you find out who posted something anonymously on public social media? If they posted from Starbucks or a library, probably not (finding the IP address won’t reveal the person’s identity). This strategy worked for a bad review of a doctor that was thought to be from another doctor: 1) file in Federal court and get a court order to get the user’s IP address from the social media website, 2) go back to the judge and get a court order to get the ISP to give the identity of the person using that IP address at that time, 3) there is a motion to quash, which confirms that the right person was found (otherwise wouldn’t bother to fight it).

Bridging The Gap Between Inside And Outside Counsel: Next Generation Strategies For Collaborating On Complex Litigation Matters
I couldn’t attend this

Preventing Inadvertent Disclosure In A Multi-Language World
Start by identifying teams and process. Be aware of cultural differences. Be aware of technological issues — there are 2 or 3 alternatives to MS Word that you might encounter for documents in Korean. Be aware of laws against removing certain documents from the country. There was disagreement among panel members about whether review quality of foreign documents was better in the U.S. due to reviewers better understanding U.S. law. Viewing a document in the U.S. that is stored on a server in the E.U. is not a valid work-around for restrictions on exporting the documents. Review in the U.S. is much cheaper than reviewing overseas (about 1/5 to 1/10 of the cost). Violation of GDPR puts 4% of revenue at risk, but a U.S. judge may not care. Take only what you need out of the country. Many tools work best when they are analyzing documents in a single language, so use language identification and separate documents by language before analysis. TAR may not work as well for non-English documents, but it does work.

What’s Your Trust Point?
I couldn’t attend this

Legal Tech And AI – Inventing The Future
Humans are better than computers at handling low-probability outlier events, because there is a lack of training data to teach machines to handle such things. It is important for the technology to be easy for the user to interact with. Legal clients are very cost averse, so a free trial of new tech is attractive.

The Cloud, New Technologies And Other Developments In Trade Secret Theft
I couldn’t attend this

Are You Prepared For The Impact Of Changing EU Data Privacy On U.S. Litigation?
I couldn’t attend this

IG Policy Pain Points In E-Discovery
Deletion of data that is not on hold 60 days after an employee leaves the company may not get everything since other custodians may have copies. You may find that employees have archived their emails on a local hard drive. Be clear about data ownership — wiping the phone of an employee that left the company may hit their personal data. The general counsel is often not involved in decisions like BYOD (treated as an IT decision), but they should be. Realize that having more data about employee behavior (e.g., GPS tracking) makes the company more responsible. You rarely need the employee’s phone since there is little data cached there (data is on mail servers, etc.). You should do info governance compliance testing to ensure that employees are following the procedures. Policies must be realistic — there won’t be perfect separation of work and personal activity. Flouted rules may be worse than no rules. Keep personal data separate (personal folder, personal email address, use phone for accessing Facebook). When doing an annual cleanup, what about the data from the employee who left the company? A study showed that 85% of stored data is rot. Have a checklist that you follow when an employee leaves — don’t wipe the computer without copying stuff you may need.

Highlights from DESI VII / ICAIL 2017

1 Reply

DESI (Discovery of Electronically Stored Information) is a one-day workshop within ICAIL (International Conference on Artificial Intelligence and Law), which is held every other year. The conference was held in London last month. Rumor has it that the next ICAIL will be in North America, perhaps Montreal.

I’m not going to go into the DESI talks based on papers and slides that are posted on the DESI VII website since you can read that content directly. The workshop opened with a keynote by Maura Grossman and Gordon Cormack where they talked about the history of TREC tracks that are relevant to e-discovery (Spam, Legal, and Total Recall), the limitation on the recall that can be achieved due to ambiguous relevance (reviewer disagreement) for some documents, and the need for high recall when it comes to identifying privileged documents or documents where privacy must be protected. When looking for privileged documents it is important to note that many tools don’t make use of metadata. Documents that are missed may be technically relevant but not really important — you should look at a sample to see whether they are important.

Between presentations based on submitted papers there was a lunch where people separated into four groups to discuss specific topics. The first group focused on e-discovery users. Visualizations were deemed “nice to look at” but not always useful — does the visualization help you to answer a question faster? Another group talked about how to improve e-discovery, including attorney aversion to algorithms and whether a substantial number of documents could be missed by CAL after the gain curve had plateaued. Another group discussed dreams about future technologies, like better case assessment and redacting video. The fourth group talked about GDPR and speculated that the UK would obey GDPR.

DESI ended with a panel discussion about future directions for e-discovery. It was suggested that a government or consumer group should evaluate TAR systems. Apparently, NIST doesn’t want to do it because it is too political. One person pointed out that consumers aren’t really demanding it. It’s not just a matter of optimizing recall and precision — process (quality control and workflow) matters, which makes comparisons hard. It was claimed that defense attorneys were motivated to lobby against the federal rules encouraging the use of TAR because they don’t want incriminating things to be found. People working in archiving are more enthusiastic about TAR.

Following DESI (and other workshops conducted in parallel on the first day), ICAIL had three more days of paper presentations followed by another day of workshops. You can find the schedule is here. I only attended the first day of non-DESI presentations. There are two papers from that day that I want to point out. The first is Effectiveness Results for Popular e-Discovery Algorithms by Yang, David Grossman, Frieder, and Yurchak. They compared performance of the CAL (relevance feedback) approach to TAR for several different classification algorithms, feature types, feature weightings, and with/without LSI. They used several different performance metrics, though they missed the one I think is most relevant for e-discovery (review effort required to achieve an acceptable level of recall). Still, it is interesting to see such an exhaustive comparison of algorithms used in TAR / predictive coding. They’ve made their code available here. The second paper is Scenario Analytics: Analyzing Jury Verdicts to Evaluate Legal Case Outcomes by Conrad and Al-Kofahi. The authors analyze a large database of jury verdicts in an effort to determine the feasibility of building a system to give strategic litigation advice (e.g., potential award size, trial duration, and suggested claims) based on a data-driven analysis of the case.

Highlights from the Northeast IG Retreat 2017

Leave a reply

The 2017 Northeast Information Governance Retreat was held at the Salamander Resort & Spa in Middleburg, Virginia. After round table discussions, the retreat featured two simultaneous sessions throughout the day. My notes below provide some highlights from the sessions I was able to attend.

Enhancing eDiscovery With Next Generation Litigation Management Software
I couldn’t attend this

Legal Tech and AI – Inventing The Future
Machines are currently only good a routine tasks. Interactions with machines should allow humans and machines to do what they do best. Some areas where AI can aid lawyers: determining how long litigation will take, suggesting cases you should reference, telling how often the opposition has won in the past, determining appropriate prices for fixed fee arrangements, recruiting, or determining which industry on which to focus. AI promises to help with managing data (e.g., targeted deletion), not just e-discovery. Facial recognition may replace plane tickets someday.

Zen & The Art Of Multi-Language Discovery: Risks, Review & Translation
I couldn’t attend this

NexLP Demo
The NexLP tool emphasizes feature extraction and use of domain knowledge from external sources to figure out the story behind the data. It can generate alerts based on changes in employee behavior over time. Company should have a policy allowing the scanning of emails to detect bad behavior. It was claimed that using AI on emails is better for privacy than having a human review random emails since it keeps human eyes away from emails that are not relevant.

TAR: What Have We Learned?
I moderated this panel, so I didn’t take notes.

Are Managed Services Manageable?
I couldn’t attend this

Cyber And Data Security For The GC: How To Stay Out Of Headlines And Crosshairs
I couldn’t attend this

The Office Is Out: Preservation And Collection In The Merry Old LandOf Office 365
Enterprise 5 (E5) has advanced analytics from Equivio. E3 and E1 can do legal hold but don’t have advanced analytics. There are options available that are not on the website, and there are different builds — people are not all using the same thing. Search functionality works on limited file types (e.g., Microsoft products). Email attachments are OK if they are from Microsoft products. It will not OCR PDFs that lack embedded text. What about emails attached to emails? Previously, it only went one layer deep on attachments. Latest versions say they are “relaxing” that, but it is unclear what that means (how deep?). User controls sync — are we really searching everything? Make sure you involve IT, privacy, info governance, etc. if considering transition to 365. Be aware of data that is already on hold if you migrate to 365. Start by migrating a small group of people that are not often subject to litigation. Test each data type after conversion.

How To Make Sense Of Information Governance Rules For Contractors When The Government Itself Can’t?
I couldn’t attend this

Judges, The Law And Guidance: Does ‘Reasonableness’ Provide Clarity?
This was primarily about the impact of the new Federal rules of civil procedure. Clients are finally giving up on putting everything on hold. Tie document retention to business needs — shouldn’t have to worry about sanctions. Document everything (e.g., why you chose specific custodians to hold). Accidentally missing one custodian out of a hundred is now OK. Some judges acknowledge the new rules but then ignore them. Boilerplate objections to discovery requests needs to stop — keep notes on why you made each objection.

Beyond The Firewall: Cybersecurity & The Human Factor
I couldn’t attend this

The Theory of Relativity: Is There A Black Hole In Electronic Discovery?
The good about Relativity: everyone knows it, it has plug-ins, and moving from document to document is fast compared to previous tools. The bad: TAR 1.0 (federal judiciary prefers CAL). An audience member expressed concern that as Relativity gets close to having a monopoly we should expect high prices and a lack of innovation. Relativity One puts kCura in competition with service providers.

The day ended with a wine social.

Clustify Blog – eDiscovery, Document Clustering, Technology-Assisted Review (Predictive Coding), Information Retrieval, and Software Development

Thoughts on e-discovery, computers, and software development.

TAR, Proportionality, and Bad Algorithms (1-NN)

TAR vs. Keyword Search Challenge, Round 2

Highlights from the South Central eDiscovery & IG Retreat 2018

TAR vs. Keyword Search Challenge

Highlights from Ipro Innovations 2018

Highlights from the NorCal eDiscovery & IG Retreat 2018

Highlights from the SoCal eDiscovery & IG Retreat 2017

Highlights from the NorCal IG Retreat 2017

Highlights from DESI VII / ICAIL 2017

Highlights from the Northeast IG Retreat 2017