Monthly Archives: November 2018

1 Reply

This iteration of the challenge was performed during the Digging into TAR session at the 2018 Northeast eDiscovery & IG Retreat. The structure was similar to round 3, but the audience was bigger. As before, the goal was to see whether the audience could construct a keyword search query that performed better than technology-assisted review.

There are two sensible ways to compare performance. Either see which approach reaches a fixed level of recall with the least review effort, or see which approach reaches the highest level of recall with a fixed amount of review effort. Any approach comparing results having different recall and different review effort cannot give a definitive conclusion on which result is best without making arbitrary assumptions about a trade off between recall and effort (this is why performance measures, such as the F₁ score, that mix recall and precision together are not sensible for ediscovery).

For the challenge we fixed the amount of review effort and measured the recall achieved, because that was an easier process to carry out under the circumstances. Specifically, we took the top 3,000 documents matching the search query, reviewed them (this was instantaneous because the whole population was reviewed in advance), and measured the recall achieved. That was compared to the recall for a TAR 3.0 process where 200 cluster centers were reviewed for training and then the top-scoring 2,800 documents were reviewed. If the system was allowed to continue learning while the top-scoring documents were reviewed, the result was called “TAR 3.0 CAL.” If learning was terminated after review of the 200 cluster centers, the result was called “TAR 3.0 SAL.” The process was repeated with 6,000 documents instead of 3,000 so you can see how much recall improves if you double the review effort.

Individuals in the audience submitted queries through a web form using smart phones or laptops and I executed some (due to limited time) of the queries in front of the audience. They could learn useful keywords from the documents matching the queries and tweak their queries and resubmit them. Unlike a real ediscovery project, they had very limited time and no familiarity with the documents. The audience could choose to work on any of three topics: biology, medical industry, or law. In the results below, the queries are labeled with the submitters’ initials (some people gave only a first name, so there is only one initial) followed by a number if they submitted more than one query. Two queries were omitted because they had less than 1% recall (the participants apparently misunderstood the task). The queries that were evaluated in front of the audience were E-1, U, AC-1, and JM-1. The discussion of the result follows the tables, graphs, and queries.

Biology	Recall
Query	Top 3,000	Top 6,000
E-1	32.0%	49.9%
E-2	51.7%	60.4%
E-3	48.4%	57.6%
E-4	45.8%	60.7%
E-5	43.3%	54.0%
E-6	42.7%	57.2%
TAR 3.0 SAL	72.5%	91.0%
TAR 3.0 CAL	75.5%	93.0%

Medical	Recall
Query	Top 3,000	Top 6,000
U	17.1%	27.9%
TAR 3.0 SAL	67.3%	83.7%
TAR 3.0 CAL	80.7%	88.5%

Law	Recall
Query	Top 3,000	Top 6,000
AC-1	16.4%	33.2%
AC-2	40.7%	54.4%
JM-1	49.4%	69.3%
JM-2	55.9%	76.4%
K-1	43.5%	60.6%
K-2	43.0%	62.6%
C	32.9%	47.2%
R	55.6%	76.6%
TAR 3.0 SAL	63.5%	82.3%
TAR 3.0 CAL	77.8%	87.8%

E-1) biology OR microbiology OR chemical OR pharmacodynamic OR pharmacokinetic
E-2) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence
E-3) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence OR pharmacokinetic OR processes OR lysis
E-4) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence OR pharmacokinetic OR processes OR lysis OR study
E-5) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence OR pharmacokinetic OR processes OR lysis OR study OR table
E-6) biology OR microbiology OR pharmacodynamic OR cellular OR enzyme OR activation OR nucleus OR protein OR interaction OR genomic OR dna OR hematological OR sequence OR pharmacokinetic OR processes OR lysis OR study OR table OR research
U) Transplant OR organ OR cancer OR hypothesis
AC-1) law
AC-2) legal OR attorney OR (defendant AND plaintiff) OR precedent OR verdict OR deliberate OR motion OR dismissed OR granted
JM-1) Law OR legal OR attorney OR lawyer OR litigation OR liability OR lawsuit OR judge
JM-2) Law OR legal OR attorney OR lawyer OR litigation OR liability OR lawsuit OR judge OR defendant OR plaintiff OR court OR plaintiffs OR attorneys OR lawyers OR defense
K-1) Law OR lawyer OR attorney OR advice OR litigation OR court OR investigation OR subpoena
K-2) Law OR lawyer OR attorney OR advice OR litigation OR court OR investigation OR subpoena OR justice
C) (law OR legal OR criminal OR civil OR litigation) AND NOT (politics OR proposed OR pending)
R) Court OR courtroom OR judge OR judicial OR judiciary OR law OR lawyer OR legal OR plaintiff OR plaintiffs OR defendant OR defendants OR subpoena OR sued OR suing OR sue OR lawsuit OR injunction OR justice

None of the keyword searches achieved higher recall than TAR when the amount of review effort was equal. All six of the biology queries were submitted by one person. The first query was evaluated in front of the audience, and his first revision to the query did help, but subsequent (blind) revisions of the query tended to hurt more than they helped. For biology, review of 3,000 documents with TAR gave better recall than review of 6,000 documents with any of the queries. There was only a single query submitted for the medical industry, and it underperformed TAR substantially. Five people submitted a total of eight queries for the law category, and the audience had the best results for that topic, which isn’t surprising since an audience full of lawyers and litigation support people would be expected to be especially good at identifying keywords related to the law. Even the best queries had lower recall with review of 6,000 documents than TAR 3.0 CAL achieved with review of only 3,000 documents, but a few of the queries did achieve higher recall than TAR 3.0 SAL when twice as much document review was performed with the search query compared to TAR 3.0 SAL.

Highlights from the Northeast eDiscovery & IG Retreat 2018

Leave a reply

The 2018 Northeast eDiscovery and Information Governance Retreat was held at the Salamander Resort & Spa in Middleburg, Virginia. It was a full day of talks with a parallel set of talks on Cybersecurity, Privacy, and Data Protection in the adjacent room. Attendees could attend talks from either track. Below are my notes (certainly not exhaustive) from the eDiscovery and IG sessions. My full set of photos is available here.

Stratagies For Data Minimization Of Legacy Data
Backup and archiving should be viewed as separate functions. When it comes to spoliation (FRCP Rule 37), reasonableness of the company’s data retention plan is key. Over preservation is expensive. There are not many cases on Rule 37 relating to backup tapes. People are changing their behavior due to the changes in the FRCP, especially in heavily regulated industries such as healthcare and financial services. Studies find that typically 70% of data has no business value and is not subject to legal hold or retention requirements for compliance. When using machine learning, you can focus on finding what to keep or what to get rid of. It is often best to start with unsupervised machine learning. Be mindful of destructive malware. To mitigate security risks, it is important to know where your data (including backup tapes) is. If a backup tape goes missing, do you need to notify customers (privacy)? To get started, create a matrix showing what you need to keep, keeping in mind legal holds and privacy (GDPR). Old backup tapes are subject to GDPR. Does the right to be forgotten apply to backup tapes? There is currently no answer. It would be hard to selectively delete data from the tapes, so maybe have a process that deletes during the restore. There can be conflicts between U.S. ediscovery and GDPR, so you must decide which is the bigger risk.

Preparing A Coordinated Response To Government Inquiries And Investigations
You might find out that you are being investigated by the FBI or other investigator approaching one of your employees — get an attorney. Reach out to the investigator, take it seriously, and ask for a timeline. You may receive a broad subpoena because the investigator whats to ensure they get everything important, but you can often get them to narrow it. Be sure to retain outside counsel immediately. In one case a CEO negotiated search terms with a prosecutor without discussing custodians, so they had to search all employees. The prosecutor can’t handle a huge volume of data, so it should be possible to negotiate a reasonable production. In addition to satisfying the subpoena, you need to simultaneously investigate whether there is an ongoing problem that needs to be addressed. Is your IT group able to forensically preserve and produce the documents? You don’t want to mess up a production in front of a regulator, so get expertise in place early. Data privacy can be an issue. When dealing with operations in Europe, it is helpful to get employee consent in advance — nobody wants to consent during an investigation. Beware of data residing in disparate systems in different languages. Google translate is not very good, e.g. you have to be careful about slang. Employees may try to cover their tracks. In one case an employee was using “chocolate” as an encoded way to refer to a payment. In another case an employee took a hammer to a desktop computer, though the hard drive was still recoverable. Look for gaps in email or anomalous email volume. Note that employees may use WhatsApp or Signal to communicate. The DOJ expects you to be systematic (e.g., use analytics) about compliance. See what data is available, even if it wasn’t subpoenaed, since it may help your side (email usually doesn’t).

Digging Into TAR
I moderated this panel, so I didn’t take notes. We challenged the audience to create a keyword search that would work better than technology-assisted review. Results are posted here.

Implementing Information Governance – Nightmare On Corporate America Street?
You need to weigh the value of the data against the risk of keeping it. What is your business model? That will dictate information governance. Domino’s was described as a technology company that happens to distribute hot bread. Unstructured data has the biggest footprint and the most rapid growth. Did you follow your policies? Your insurance company may be very picky about that when looking for a reason not to pay out. They may pay out and then sue you over the loss. Fear is a good motivator. Threats from the OCC or FDIC over internal data management can motivate change. You can quantify risk because the cost of having a data breach is now known. Info governance is utilization awareness, not just data management. Know where your data is. What about the employee that creates an unauthorized AWS account? This is the “shadow ecosystem” or “shadow IT.” One company discovered they had 50,000 collaborative SharePoint sites they didn’t know about. For info governance standards see The Sedona Conference and EDRM.

Technology Solution Update From Corporate, Law Firm And Service Provider Perspective
Artificial intelligence (AI) should not merely analyze; it should present a result in a way that is actionable. It might tell you how much two people talk, their sentiment, and whether there are any spikes in communication volume. AI can be used by law firms for budgeting by analyzing prior matters. There are concerns about privacy with AI. Many clients are moving to the cloud. Many are using private clouds for collaboration, not necessarily for utilizing large computing power. Office 365 is of interest to many companies. There was extensive discussion about the ediscovery analytics capabilities being added from the Equivio acquisition, and a demo by Marcel Katz of Microsoft. The predictive coding (TAR) capability uses simple active learning (SAL) rather than continuous active learning (CAL). It is 20 times slower in the cloud than running Equivio on premises. There is currently no review tool in Office 365, so you have to export the predictions out and do the review elsewhere. Mobile devices create additional challenges for ediscovery. The time when a text message is sent may not match the time when it is received if the receiving device is off when the message is sent. Technology needs to be able to handle emojis. There are many different apps with many different data storage formats.

The ‘Team Of Teams’ Approach To Enterprise Security And Threat Management
Fast response is critical when you are attacked. Response must be automated because a human response is not fast enough. It can take 200 days to detect an adversary on the network, so assume someone is already inside. What are the critical assets, and what threats should you look for? What value does the data have to the attacker? What is the impact on the business? What is the impact on the people? Know what is normal for your systems. Is a large data transfer at 2:00am normal? Simulate a phishing attack and see if your employees fall for it. In one case a CEO was known to be in China for a deal, so someone impersonating the CEO emailed the CFO to send $50 million for the deal. The money was never recovered. Have processes in place, like requiring a signature for amounts greater than $10,000. If a company is doing a lot of acquisitions, it can be hard to know what is on their network. How should small companies get started? Change passwords, hire an external auditor, and make use of open source tools.

From Data To GRC Insight
Governance, risk management, and compliance (GRC) needs to become centralized and standardized. Practicing incident response as a team results in better responses when real incidents happen. Growing data means growing risk. Beware of storage of social security numbers and credit card numbers. Use encryption and limit access based on role. Detect emailing of spreadsheets full of data. Know what the cost of HIPAA violations is and assign the risk of non-compliance to an individual. Learn about the NIST Cybersecurity Framework. Avoid fines and reputational risk, and improve the organization. Transfer the risk by having data hosted by a company that provides security. Cloud and mobile can have big security issues. The company can’t see traffic on mobile devices to monitor for phishing.

Clustify Blog – eDiscovery, Document Clustering, Technology-Assisted Review (Predictive Coding), Information Retrieval, and Software Development

Thoughts on e-discovery, computers, and software development.

Monthly Archives: November 2018

Podcast: Can You Do Good TAR with a Bad Algorithm?

Best Legal Blog Contest 2018

TAR vs. Keyword Search Challenge, Round 4

Highlights from the Northeast eDiscovery & IG Retreat 2018