Monthly Archives: May 2018

TAR vs. Keyword Search Challenge

During my presentation at the NorCal eDiscovery & IG Retreat I challenged the audience to create keyword searches that would work better than technology-assisted review (predictive coding) for two topics. Half of the room was tasked with finding articles about biology (science-oriented articles, excluding medical treatment) and the other half searched for articles about current law (excluding proposed laws or politics). I ran one of the searches against TAR in Clustify live during the presentation (Clustify’s “shadow tags” feature allows a full document review to be simulated in a few minutes using documents that were pre-categorized by human reviewers), but couldn’t do the rest due to time constraints. This article presents the results for all the queries submitted by the audience.

The audience had limited time to construct queries (working together in groups), they weren’t familiar with the data set, and they couldn’t do sampling to tune their queries, so I’m not claiming the exercise was comparable to an e-discovery project. Still, it was entertaining. The topics are pretty simple, so a large percentage of the relevant documents can be found with a pretty simple search using some broad terms. For example, a search for “biology” would find 37% of the biology documents. A search for “law” would find 71% of the law articles. The trick is to find the relevant documents without pulling in too many of the non-relevant ones.

To evaluate the results, I measured the recall (percentage of relevant documents found) from the top 3,000 and top 6,000 hits on the search query (3% and 6% of the population respectively). I’ve also included the recall achieved by looking at all docs that matched the search query, just to see what recall the search queries could achieve if you didn’t worry about pulling in a ton of non-relevant docs. For the TAR results I used TAR 3.0 trained with two seed documents (one relevant from a keyword search and one random non-relevant document) followed by 20 iterations of 10 top-scoring cluster centers, so a total of 202 training documents (no control set needed with TAR 3.0). To compare to the top 3,000 search query matches, the 202 training documents plus 2,798 top-scoring documents were used for TAR, so the total document review (including training) would be the same for TAR and the search query.

The search engine in Clustify is intended to help the user find a few seed documents to get active learning started, so it has some limitations. If the audience’s search query included phrases, they were converted an AND search enclosed in parenthesis. If the audience’s query included a wildcard, I converted it to a parenthesized OR search by looking at the matching words in the index and selecting only the ones that made sense (i.e., I made the queries better than they would have been with an actual wildcard). I noticed that there were a lot of irrelevant words that matched the wildcards. For example, “cell*” in a biology search should match cellphone, cellular, cellar, cellist, etc., but I excluded such words. I would highly recommend that people using keyword search check to see what their wildcards are actually matching–you may be pulling in a lot of irrelevant words. I removed a few words from the queries that weren’t in the index (so the words shown all actually had an impact). When there is an “a” and “b” version of the query, the “a” version was the audience’s query as-is, and the “b” query was tweaked by me to retrieve more documents.

The tables below show the results. The actual queries are displayed below the tables. Discussion of the results is at the end.

Biology		Recall
Query	Total Matches	Top 3,000	Top 6,000	All Matches
1	4,407	34.0%	47.2%	47.2%
2	13,799	37.3%	46.0%	80.9%
3	25,168	44.3%	60.9%	87.8%
4a	42	0.5%		0.5%
4b	2,283	20.9%		20.9%
TAR		72.1%	91.0%

Law		Recall
Query	Total Matches	Top 3,000	Top 6,000	All Matches
5a	2,914	35.8%		35.8%
5b	9,035	37.2%	49.3%	60.6%
6	534	2.9%		2.9%
7	27,288	32.3%	47.1%	79.1%
TAR		62.3%	80.4%

1) organism OR microorganism OR species OR DNA

2) habitat OR ecology OR marine OR ecosystem OR biology OR cell OR organism OR species OR photosynthesis OR pollination OR gene OR genetic OR genome AND NOT (treatment OR generic OR prognosis OR placebo OR diagnosis OR FDA OR medical OR medicine OR medication OR medications OR medicines OR medicated OR medicinal OR physician)

3) biology OR plant OR (phyllis OR phylos OR phylogenetic OR phylogeny OR phyllo OR phylis OR phylloxera) OR animal OR (cell OR cells OR celled OR cellomics OR celltiter) OR (circulation OR circulatory) OR (neural OR neuron OR neurotransmitter OR neurotransmitters OR neurological OR neurons OR neurotoxic OR neurobiology OR neuromuscular OR neuroscience OR neurotransmission OR neuropathy OR neurologically OR neuroanatomy OR neuroimaging OR neuronal OR neurosciences OR neuroendocrine OR neurofeedback OR neuroscientist OR neuroscientists OR neurobiologist OR neurochemical OR neuromorphic OR neurohormones OR neuroscientific OR neurovascular OR neurohormonal OR neurotechnology OR neurobiologists OR neurogenetics OR neuropeptide OR neuroreceptors) OR enzyme OR blood OR nerve OR brain OR kidney OR (muscle OR muscles) OR dna OR rna OR species OR mitochondria

4a) statistically AND ((laboratory AND test) OR species OR (genetic AND marker) OR enzyme) AND NOT (diagnosis OR treatment OR prognosis)

4b) (species OR (genetic AND marker) OR enzyme) AND NOT (diagnosis OR treatment OR prognosis)

5a) federal AND (ruling OR judge OR justice OR (appellate OR appellant))

5b) ruling OR judge OR justice OR (appellate OR appellant)

6) amendments OR FRE OR whistleblower

7) ((law OR laws OR lawyer OR lawyers OR lawsuit OR lawsuits OR lawyering) OR (regulation OR regulations) OR (statute OR statutes) OR (standards)) AND NOT pending

TAR beat keyword search across the board for both tasks. The top 3,000 documents returned by TAR achieved higher recall than the top 6,000 documents for any keyword search. In other words, if documents will be reviewed before production, TAR achieves better results (higher recall) with half as much document review compared to any of the keyword searches. The top 6,000 documents returned by TAR achieved higher recall than all of the documents matching any individual keyword search, even when the keyword search returned 27,000 documents.

Similar experiments were performed later with many similarities but also some notable differences in the structure of the challenge and the results. You can read about them here: round 2, round 3, round 4, round 5, and round 6.

Highlights from Ipro Innovations 2018

Highlights from the NorCal eDiscovery & IG Retreat 2018

Leave a reply

The 2018 NorCal eDiscovery & IG Retreat was held at the Carmel Valley Ranch, location of the first Ing3nious retreat in 2011 (though the company wasn’t called Ing3nious at the time). It was a full day of talks with a parallel set of talks on Cybersecurity, Privacy, and Data Protection in the adjacent room. Attendees could attend talks from either track. Below are my notes (certainly not exhaustive) from the eDiscovery and IG sessions. My full set of photos is available here.

Digging Into TAR
I moderated this panel, so I didn’t take notes. We challenged the audience to create a keyword search that would work better than TAR. Results are posted here.

Information Governance In The Age Of Encryption And Ephemeral Communications
Facebook messenger has an ephemeral mode, though it is currently only available to Facebook executives. You can be forced to decrypt data (despite the 5th Amendment) if it can be proven that you have the password. Ephemeral communication is a replacement for in-person communication, but it can look bad (like you have something to hide). 53% of email is read on mobile devices, but personal devices often aren’t collected. Slack is useful for passing institutional knowledge along to new employees, but general counsel wants things deleted after 30 days. Some ephemeral communication tools have archiving options. You may want to record some conversations in email–you may need them as evidence in the future. Are there unencrypted copies of encrypted data in some locations?

Blowing The Whistle
eDiscovery can be used as a weapon to drive up costs for an adversary. The plaintiff should be skeptical about about claims of burden–has appropriate culling been performed? Do a meet and confer as early as possible. Examine data for a few custodians and see if more are needed. A data dump is when a lot of non-relevant docs are produced (e.g., due to a broad search or a search that matches an email signature). Do sampling to test search terms. Be explicit about what production formatting you want (e.g., searchable PDF, color, meta data).

Emerging Technology And The Impact On eDiscovery
There may be a lack of policy for new data sources. Text messages and social media are becoming relevant for more cases. Your Facebook info can be accessed through your friends. Fitbit may show whether the person could have committed the murder. IP addresses can reveal whether email was sent from home or work. The change to the Twitter character limit may break some collection tools–QC early on to detect such problems. Vendors should have multiple tools. Communicate about what tech is involved and what you need to collect.

Technology Solution Update From Corporate, Law Firm And Service Provider Perspective
Cloud computing (infrastructure, storage, productivity, and web apps) will cause conflict between EU privacy law and US discovery. AWS provides lots of security options, but it can be difficult to get right (must be configured correctly). Startups aim to build fast and don’t think enough about how to get the data out. Are law firm clients looking at cloud agreements and how to export data? Free services (Facebook, Gmail, etc.) spy on users, which makes them inappropriate for corporate use where privacy is needed. Slack output is one long conversation. What about tools that provide a visualization? You may need the data, not just a screenshot. Understand the limit of repositories–Office 365 limits to 10GB of PST at a time. What about versioning storage? It is becoming more common as storage prices decline. Do you need to collect all versions of a document? “Computer ate my homework” excuses don’t fare well in court (e.g., production of privileged docs due to a bad mouse click, or missing docs matching a keyword search because they weren’t OCRed). GDPR requires knowing where the users are (not where the data is stored). Employees don’t want their private phones collected, so sandbox work stuff.

Employing Intelligence – Both Human And Artificial (AI) – To Reduce Overall eDiscovery Costs
You need to talk to custodians–the org chart doesn’t really tell you what you need to know. Search can show who communicates with whom about a topic. To discover that a custodian is involved that is not known to the attorney, look at the data and interview the ground troops. Look for a period when there is a lack of communication. Use sentiment analysis (including emojis). Watch for strange bytes in the review tool–they may be emojis that can only be viewed in the original app. Automate legal holds as much as possible. Escalate to a manager if the employee doesn’t respond to the hold in a timely manner. Filter on meta data to reduce the amount that goes into the load file. Sometimes things go wrong with the software (trained on biased data, not finding relevant spreadsheets, etc.). QC to ensure the human element doesn’t fail. Use phonetic search on audio files instead of transcribing before search. Analyze data as it comes in–you may spot months of missing email. Do proof of concept when selecting tools.

Practical Discussion: eDiscovery Process With Law Firms, In-House And Vendor
Stick with a single vendor so you know it is done the same way every time. Figure out what your data sources are. Get social media data into the review platform in a usable form (e.g., Skype). Finding the existence of cloud data stores requires effort. How long is the cloud data being held (Twitter only holds the last 100 direct messages)? The company needs to provide the needed apps so employees aren’t tempted to go outside to get what they need.

Clustify Blog – eDiscovery, Document Clustering, Technology-Assisted Review (Predictive Coding), Information Retrieval, and Software Development

Thoughts on e-discovery, computers, and software development.

Monthly Archives: May 2018

TAR vs. Keyword Search Challenge

Highlights from Ipro Innovations 2018

Highlights from the NorCal eDiscovery & IG Retreat 2018