September | 2014 | Clustify Blog – eDiscovery, Document Clustering, Technology-Assisted Review (Predictive Coding), Information Retrieval, and Software Development

The East Coast eDiscovery & IG Retreat is a new addition to the series of retreats held by Chris La Cour’s company, Ing3nious. It was held at the Chatham Bars Inn in Chatham, Massachusetts. This is the fourth year Chris has been organizing retreats, with the number of retreats and diversity of themes increasing in recent years, but this is the first one held outside of California. As always, the venue was beautiful (more photos here), and the conference was informative and well-organized. My notes below only capture a small amount of the information presented. There were often two simultaneous sessions, so I couldn’t attend everything.

Keynote: Information Governance in a Predictive World

Big data isn’t just about applying the same old analysis to more data. It’s about real-time or near-real-time, action-oriented analysis. It allows asking new questions. Technology now exists to allow police to scan license plates while driving through a parking lot to check for stolen cars. Amazon may implement predictive shipping, where it ships a product to a customer before the customer orders it, which requires predictions with high confidence. Face recognition technology is already in use at airports to track who is entering and whether they belong there. When he tweeted a complaint about an airline, he got a personalized reply from the airline on twitter within 30 seconds, thanks to technology.

Proactive information governance will allow problems (sexual harassment, fraud) to be detected immediately so they can be corrected, instead of finding out later when there is a lawsuit. It is possible to predict when someone will leave a company — they may become short with people, or spend more time on LInkedIn.

Change will come to the Federal Rules of Civil Procedure in 2015. Rule 37(e) on sanctions for spoliation will change from “willful or bad faith” to “willful and bad faith” to encourage more deletion. Have a policy to delete as much as possible, follow it, and be prepared to prove that you follow it.

Case Study: The Swiss Army Knife Approach to eDiscovery

Predictive coding failed for a case where the documents were very homogeneous (keywords didn’t work either; the document set was large, but not too large for eyes-on review). It also failed when there were too many issue codes. Predictive coding had problems with Spanish documents where there many different dialects. Predictive coding works well for prioritizing documents when there is a quick deposition schedule.

Email domain name filtering can be used to remove junk or to detect things like gmail accounts that may be relevant. Also, look for gaps in dates for emails — it could be that the person was on vacation, or maybe something was removed.

Clustering or near-dupe is useful to ensure consistency of redactions.

There is a benefit to reviewing the documents yourself to understand the case. Not a fan of producing documents you haven’t seen.

May need to do predictive coding or clustering in a foreign country to minimize the amount of data that needs to be brought to the U.S..

Talk to custodians about acronyms. Look at word lists — watch for unusual words, or how word usage corresponds to timing of events.

Real-World Impact of Information Governance on the eDiscovery Process

I couldn’t attend this one.

Future-think: How Will eDiscovery Be Different in 5-10 Years?

Cost cutting: Cull in house, and targeted collections. The big obstacle is getting people to learn technology. Some people still print email to read it.

Analysis of cases is accelerating. Even small cases are impacted by technology. For example, information about a car accident is recorded. In the future, ESI will be collected and a conclusion will be reached without a trial. Human memory is obsolete — everything will be ESI.

Personal devices may be subject to discovery. Employment agreements should make that clear in advance.

Privacy will be a big legal field. Information is even collected about children — what they eat at school and when they are on the bus.

Will schools be held liable for student loans if the school fails to predict that the student will fail?

There are concerns about security of the cloud.

Businesses should demand change to make things more efficient, like arbitration.

Information Governance – Teams, Litigation Holds and Best Practices

I couldn’t attend this one.

Recent Developments in Technology Assisted Review — Is TAR Gaining Traction?

Three panelists said predictive coding didn’t work for them for identifying privileged documents. One panelist (me) said he had a case where it worked well for priv docs. Although I didn’t mention it at the time since I wouldn’t be able to get the reference right off of the top of my head, there is a paper with a nice discussion about the issues around finding priv docs that also claims success at using predictive coding.

What level of recall is necessary? There seemed to be consensus that 75% was acceptable for most purposes, but people sometimes aim higher, depending on the circumstances (e.g., to ward off objections from the other side).

Is it OK to use keyword search to cull down the document population before applying predictive coding? Must be careful not to cull away too many of the relevant documents (e.g., the Biomet case).

There is a lot of concern about being required to turn over training documents (especially non-responsive ones) to the other side. I pointed out that it is not like turning over search terms. It is very clear whether or not a document matches a search query, but disclosing training documents does not tell what predictions a particular piece of predictive coding software will give. In fact, some software will (hopefully, rarely) fail to produce documents that are near-dupes of relevant training documents, so one should not assume that the disclosure of training documents guarantees anything about what will be produced. There was concern that disclosure of non-relevant training documents by some parties will set a bad precedent.

Top 5 Trends in Discovery for 2014

I couldn’t attend this one.

Recruiting the best eDiscovery Team

Cybersecurity is a concern. It is important to vet service providers. Many law firms are not as well protected as one would like.

When required to give depositions about e-discovery process, paralegals can do well. IT people tend to get stressed. Lawyers can be too argumentative.

Need a champion to encourage everyone to get things done.

Legal hold is often drafted by outside counsel but enforced by in-house counsel.

Don’t have custodians do their own self-collection (e.g., based on search terms), but may have IT do collection (less expensive than using outside consultant, but must be able to explain what they did).

Information governance and changes to FRCP will reduce costs over the next five years.

There has been some debate recently about the value of the “eRecall” method compared to the “Direct Recall” method for estimating the recall achieved with technology-assisted review. This article shows why eRecall requires sampling and reviewing just as many documents as the direct method if you want to achieve the same level of certainty in the result.

Here is the equation:
eRecall = (TotalRelevant – RelevantDocsMissed) / TotalRelevant

Rearranging a little:
eRecall = 1 – RelevantDocsMissed / TotalRelevant
= 1 – FractionMissed * TotalDocumentsCulled / TotalRelevant

It requires estimation (via sampling) of two quantities: the total number of relevant documents, and the number of relevant documents that were culled by the TAR tool. If your approach to TAR involves using only random sampling for training, you may have a very good estimate of the prevalence of relevant documents in the full population by simply measuring it on your (potentially large) training set, so you multiply the prevalence by the total number of documents to get TotalRelevant. To estimate the number of relevant documents missed (culled by TAR), you would need to review a random sample of the culled documents to measure the percentage of them that were relevant, i.e. FractionMissed (commonly known as the false omission rate or elusion). How many?

To simplify the argument, let’s assume that the total number of relevant documents is known exactly, so there is no need to worry about the fact that the equation involves a non-linear combination of two uncertain quantities. Also, we’ll assume that the prevalence is low, so the number of documents culled will be nearly equal to the total number of documents. For example, if the prevalence is 1% we might end up culling about 95% to 98% of the documents. With this approximation, we have:

eRecall = 1 – FractionMissed / Prevalence

It is the very small prevalence value in the denominator that is the killer–it amplifies the error bar on FractionMissed, which means we have to take a ton of samples when measuring FractionMissed to achieve a reasonable error bar on eRecall.

Let’s try some specific numbers. Suppose the prevalence is 1% and the recall (that we’re trying to estimate) happens to be 75%. Measuring FractionMissed should give a result of about 0.25% if we take a big enough sample to have a reasonably accurate result. If we sampled 4,000 documents from the culled set and 10 of them were relevant (i.e., 0.25%), the 95% confidence interval for FractionMissed would be (using an exact confidence interval calculator to avoid getting bad results when working with extreme values, as I advocated in a previous article):

FractionMissed = 0.12% to 0.46% with 95% confidence (4,000 samples)

Plugging those values into the eRecall equation gives a recall estimate ranging from 54% to 88% with 95% confidence. Not a very tight error bar!

If the number of samples was increased to 40,000 (with 100 being relevant, so 0.25% again), we would have:

FractionMissed = 0.20% to 0.30% with 95% confidence (40,000 samples)

Plugging that into the eRecall equation gives a recall estimate ranging from 70% to 80% with 95% confidence, so we have now reached the ±5% level that people often aim for.

For comparison, the Direct Recall method would involve pulling a sample of 40,000 documents from the whole document set to identify roughly 400 random relevant documents, and finding that roughly 300 of the 400 were correctly predicted by the TAR system (i.e., 75% recall). Using the calculator with a sample size of 400 and 300 relevant (“relevant” for the calculator means correctly-identified for our purposes here) gives a recall range of 70.5% to 79.2%.

So, the number of samples required for eRecall is about the same as the Direct Recall method if you require a comparable amount of certainty in the result. There’s no free lunch to be found here.

Clustify Blog – eDiscovery, Document Clustering, Technology-Assisted Review (Predictive Coding), Information Retrieval, and Software Development

Thoughts on e-discovery, computers, and software development.

Monthly Archives: September 2014

Highlights from the East Coast eDiscovery & IG Retreat 2014

eRecall: No Free Lunch