Category Archives: eDiscovery

Disclosing Seed Sets and the Illusion of Transparency

There has been a great deal of debate about whether it is wise or possibly even required to disclose seed sets (training documents, possibly including non-relevant documents) when using predictive coding.  This article explains why disclosing seed sets may provide far less transparency than people think.

seed_sproutThe rationale for disclosing seed sets seems to be that the seed set is the input to the predictive coding system that determines which documents will be produced, so it is reasonable to ask for it to be disclosed so the requesting party can be assured that they will get what they wanted, similar to asking for a keyword search query to be disclosed.

Some argue that the seed set may be work product (if attorneys choose which documents to include rather than using random sampling).  Others argue that disclosing non-relevant training documents may reveal a bad act other than the one being litigated.  If the requesting party is a  competitor, the non-relevant training documents may reveal information that helps them compete.  Even if the producing party is not concerned about any of the issues above, it may be reluctant to disclose the seed set due to fear of establishing a precedent it may not want to be stuck with in future cases having different circumstances.

Other people are far more qualified to debate the legal and strategic issues than I am.  Before going down that road, I think it’s worthwhile to consider whether disclosing seed sets really provides the transparency that people think.  Some reasons why it does not:

  1. If you were told that the producing party would be searching for evidence of data destruction by doing a keyword search for “shred AND documents,” you could examine that query and easily spot deficiencies.  A better search might be “(shred OR destroy OR discard OR delete) AND (documents OR files OR records OR emails OR evidence).”  Are you going to review thousands of training documents and realize that one relevant training document contains the words “shred” and “documents” but none of the training documents contain “destroy” or “discard” or “files”?  I doubt it.
  2. You cannot tell whether the seed set is sufficient if you don’t have access to the full document population.  There could be substantial pockets of important documents that are not represented in the seed set–how would you know?  The producing party has access to the full population, so they can do statistical sampling to measure the quality (based on number of relevant documents, not their importance) of the predictions the training set will produce.  The requesting party cannot do that–they have no way of assessing adequacy of the training set other than wild guessing.
  3. You cannot tell whether the seed set is biased just by looking at it.  Again, if you don’t have access to the full population, how could you know if some topic or some particular set of keywords is under or over represented?  If training documents were selected by searching for “shred AND Friday,” the system would see both words on all (or most) of the relevant documents and would think both words are equally good indicators of relevance.  Would you notice that all the relevant training documents happen to contain the word “Friday”?  I doubt it.
  4. Suppose you see an important document in the seed set that was correctly tagged as being relevant.  Can you rest assured that similar documents will be produced?  Maybe not.  Some classification algorithms can predict a document to be non-relevant when it is a near-dupe or even an exact dupe of a relevant training document.  I described how that could happen in this article.  How can you claim that the seed set provides transparency if you don’t even know if a near-dupe of a relevant training document will be produced?
  5. Poor training doesn’t necessarily mean that relevant documents will be missed.  If a relevant document fails to match a keyword search query, it will be missed, so ensuring that the query is good is important.  Most predictive coding systems generate a relevance score for each document, not just a binary yes/no relevance prediction like a search query.  Whether or not the predictive coding system produces a particular relevant document doesn’t depend solely on the training set–the producing party must choose a cutoff point in the ranked document list that determines which documents will be produced.  A poorly trained system can still achieve high recall if the relevance score cutoff is chosen to be low enough.  If the producing party reviews all documents above the relevance score cutoff before producing them, a poorly trained system will require a lot more document review to achieve satisfactory recall.  Unless there is talk of cost shifting, or the producing party is claiming it should be allowed to stop at modest recall because reaching high recall would be too expensive, is it really the requesting party’s concern if the producing party incurs high review costs by training the system poorly?
  6. One might argue that the producing party could stack the seed set with a large number of marginally relevant documents while avoiding really incriminating documents in order to achieve acceptable recall while missing the most important documents.  Again, would you be able to tell that this was done by merely examining the seed set without having access to the full population?  Is the requesting party going to complain that there is no smoking gun in the training set?  The producing party can simply respond that there are no smoking guns in the full population.
  7. The seed set may have virtually no impact on the final result.  To appreciate this point we need to be more specific about what the seed set is, since people use the term in many different ways (see Grossman & Cormack’s discussion).  If the seed set is taken to be a judgmental sample (documents selected by a human, perhaps using keyword search) that is followed by several rounds of additional training using active learning, the active learning algorithm is going to have a much larger impact on the final result than the seed set if active learning contributes a much larger number of relevant documents to the training.  In fact, the seed set could be a single relevant document and the result would have almost no dependence on which relevant document was used as the seed (see the “How Seed Sets Influence Which Documents are Found” section of this article).  On the other hand, if you take a much broader definition of the seed set and consider it to be all documents used for training, things get a little strange if continuous active learning (CAL) is used.  With CAL the documents that are predicted to be relevant are reviewed and the reviewers’ assessments are fed back into the system as additional training to generate new predictions.  This is iterated many times.  So all documents that are reviewed are used as training documents.  The full set of training documents for CAL would be all of the relevant documents that are produced as well as all non-relevant documents that were reviewed along the way.  Disclosing the full set of training documents for CAL could involve disclosing a very large number of non-relevant documents (comparable to the number of relevant documents produced).

Trying to determine whether a production will be good by examining a seed set that will be input into a complex piece of software to analyze a document population that you cannot access seems like a fool’s errand.  It makes more sense to ask the producing party what recall it achieved and to ask questions to ensure that recall was measured sensibly.  Recall isn’t the whole story–it measures the number of relevant documents found, not their importance.  It makes sense to negotiate the application of a few keyword searches to the documents that were culled (predicted to be non-relevant) to ensure that nothing important was missed that could easily have been found.  The point is that you should judge the production by analyzing the system’s output, not the training data that was input.

Highlights from the ACEDS 2015 E-Discovery Conference

The conference moved from Florida to Washington, D.C. this year. aceds2015_hotel_outside It was two full days of talks, often with two simultaneous sessions.  Attendance seemed to be up compared to last year.  My notes below provide only a few highlights from the subset of the sessions that I was able to attend.

Keynote: Business as (Un)usual: Leveraging the Changing Legal Marketplace
Law firms need to do a better job of paying attention to what their clients want.  Clients want collaboration, teamwork, and compensation based on performance not credentials.  Move beyond e-discovery and help them avoid litigation.aceds2015_keynote

The Journey of 1000 Terabytes Begins with a Single Email: A Step-byStep Guide to Applied Information Governance
Success in IG requires that it be phased in and have C-level buy-in, budget, dedicated staff, process, and technology.  Who owns the data on gmail, Skype, etc.?  You must provide good tools internally or employees will use external tools like Slack, allowing data to leak out.  Make sure employees know the policy about taking their phones if necessary due to e-discovery.  Need to have employees periodically re-read and acknowledge IG policies.  Will BYOD die due to lack of separation between company and personal data?

You’ve Been Hacked, Now What?  The Justice Department gives Guidance on Data Breach Mitigation, Response and Ethical Conundrums
I couldn’t attend

E-Discovery for the Other 85 Percent: Achieving Proportionality and Defensibility in Small Cases
This session was on e-discovery for small cases.  There were several (sometimes controversial) tips for keeping costs down, including emailing custodian questionnaires instead of interviewing, collecting specific file types instead of imaging hard drives, viewing PST files in Outlook instead of processing (an audience member commented that the emails could be changed in Outlook and white text on a white background might be missed), and skipping privilege review if not needed (use a clawback agreement).  Judge Nolan said the biggest cost in e-discovery is judges, lawyers, and clients not knowing what they are doing.  She pointed out the Discovery Pilot Program.

Securing Client Data in a Post-Sony World: Shoring up Breach Points Among Clients, Law Firm and Vendor Partners
I couldn’t attendaceds2015_lunch

Federal Judge Discusses E-Discovery Related Issues and Offers Guidance Regarding Persistent and Emerging Conundrums
Part of the session focused on e-discovery in criminal litigation, where the government is usually the producing party.  Judge Vanaskie said that criminal defense lawyers do seem to know e-discovery pretty well.  He mentioned that Apple asks to have its vendors’ ESI charges sealed.  He explained that “taxation” means having the losing party pay the winning party’s e-discovery costs.

Helping ACEDS Members in Transition
I couldn’t attend

Practical Tips on Reducing Corporate Litigation Risks and Costs
In-house counsel needs to understand the business to be seen as a partner rather than as an obstacle.  Analyze contracts and be careful about arbitration clauses with no e-discovery limit.  One panel member suggested shopping e-discovery vendors for the best price while another pointed out that relationships may be more important than price.  Reduce the risk of a Sony-like problem by deleting old data.  It may be wise to defend against frivolous lawsuits, even if defense is expensive compared to the amount of the suit, to build a reputation that will avoid getting sued over and over.  Deleting active data won’t help if off-site backups remain.  Law firms are not good on IG — they tend to over-preserve.  Might want to avoid integrated voicemail/email because you may end up having to do e-discovery on .WAV files.  Law firms are too slow on moving to technology-assisted review (TAR).

Moore’s Law, Artificial Intelligence and the Coming Impact of Technology on Law and Discovery
Driverless cars will impact truckers, taxi drivers, and people in the auto insurance industry.  Will AI replace lawyers?  The Singularity is when computers become smarter than people, and is predicted to come as soon as 2048.  CPUs are getting faster (Moore’s Law), storage is getting cheaper (Kryder’s Law), bandwidth is increasing (Nielsen’s Law), and the value of networks increases as the number of nodes increases (Metcalfe’s Law).  Moravec’s Paradox says that high-order tasks are easy to program but low-order tasks are hard.  Can computers be creative?

The EDRM eMSAT – E-Discovery Maturity Self Assessment Test
I couldn’t attend

Behind TAR’s ‘Vale’: How to Strike the Balance between Transparency, Disclosure and Cooperation
This was a somewhat contentious session.  One panelist said he would disclose the use of TAR (but thought you didn’t have to), but would expect to reveal no more than that.  Another panelist advocated the disclosure of seed sets and algorithms.  Another pointed out that disclosure of non-responsive seed documents could be bad if the requesting party is a competitor.  The argument that  a seed set is work product may apply if the seed set is a judgmental sample (the specific documents chosen for training were picked by the lawyer), but may not apply for a random sample.

A Federal Judge’s Views Regarding E-Discovery Trends and Their Implications
New rules acknowledge that judges can shift costs when necessary.  There are many tools for proportionality (sampling, capping time and money spent).  Judge Grimm thought it was better for judges to be active so e-discovery problems could be avoided.  Sanctions for failure to preserve are only available if there was an intent to deprive the requesting party — this promotes reasonableness instead of having different rules in different jurisdictions.

Challenging the Assumptions, Claims, and Givens of TAR to Make It More Effective and Just
I was on the panel, so I didn’t take notes

Swimming in the Blender: Successfully Navigating, Surviving (and maybe even surfing) E-Discovery Career Challenges
I couldn’t attend

Courts’ Vetting TAR Technologies and Methodologies: What Is the Proper Standard of Review?
Judge Waxse said lawyers want to drag out the case (billable hours), whereas judges and clients want a just and speedy trial — judges should try to involve the client to move things along.  Judge Vanaskie said active management is imperative (agreeing with Judge Grimm’s earlier statement).  He doesn’t like special masters — he wants to know what is going on.  Judge Waxse said that “zealous advocacy” is gone and never applied to e-discovery — the culture needs to be fixed (regarding cooperation).  He also said that Daubert applies to all proceedings, including e-discovery, not just the trial (disagreeing with Judge Peck’s writing on this).  On the topic of seed sets being work product, Judge Vanaskie questioned whether the seed set itself really reveals the thought process that went into selecting the seed documents.  Judge Nolan said that keyword search queries are not work product.  Judge Levie said that disclosing seed sets was not one-size-fits-all.  On use of special masters, Judge Levie said whether there was private communication between the judge and the special master varies.

Found Money: Raising E-Discovery Related Realization Rates
I couldn’t attendaceds2015_hotel_inside

Judges’ Review of E-Discovery’s New Rules, Rulings and Requirements
The new rules move proportionality back to where it originally was.  Judge Waxse said the current rules cause problems because you can’t assess the proportionality factors (amount in controversy, needs of the case, etc.) early in the case.  Judge Rodriguez questioned where the boundary is between a judge managing the e-discovery and advocating for a side.  He also said the amended rules encourage face-to-face with the judge.  Most sanctions for preservation failure involve both bad faith and dishonesty.  Lawyers are conservative — will they really tell clients to lighten up on preservation under the new rules?  Preserved data can help your case, too.  Retention policy should depend on content, not just format, so deletion of all email after some number of days (e.g., 30 or 75) is bad.

Exploiting the New Rules, Rulings and Requirements
I couldn’t attend

Reengineering How E-Discovery is Practiced (and Managed) Using Data, Dash-Boarding and even Dynamic Organizations
I couldn’t attend

“ED” Talks: Industry Experts Tell Us What’s On Their Mind.  And We All React
I couldn’t attend

Highlights from DESI VI

DESI VI was held at the University of San Diego.  DESI_VI_statueThe workshop on Discovery of Electronically Stored Information was attended by an enthusiastic and fairly technical crowd that included professors, grad students, industry researchers, lawyers and practitioners.  I won’t go into details about the talks because full papers (and sometimes slides) are available at the workshop’s website.  The slides from my presentation are here.  You can find my full set of photos from DESI here, and photos from nearby Presidio Park here.

Various topics were discussed over lunch and discussion leaders summarized the discussions at the end of the day.  I’ll provide my notes on those discussions.  DESI_VI_lunchOn the topic of what clients want from service providers, they are interested in hot documents that are unforseen,  benchmarks on the service provider’s work, transparency on how the service provider did it, and development of a narrative for the case.  There was a feeling that there should be standards for deduping.  Attorneys need to know what metadata to ask for.  On the topic of natural language processing, there is concern about how to incorporate knowledge from depositions into the process, and how to recognize that “take me out for tea” might be an encoded solicitation or offer of a bribe.  Stories and context are important, and they may not be captured by a single document.  On the topic of classification, it was noted that categories can change over time.  Should the user adapt to the system or vice versa?  It was mentioned that technology can make you a better lawyer because it allows you to know the case better than your opponent.

Using Extrapolated Precision for Performance Measurement

This is a brief overview of my paper “Information Retrieval Performance Measurement Using Extrapolated Precision,” which I’ll be presenting on June 8th at the DESI VI workshop at ICAIL 2015 (slides now available here).  The paper provides a novel method for extrapolating a precision-recall point to a different level of recall, and advocates making performance comparisons by extrapolating results for all systems to the same level of recall if the systems cannot be evaluated at exactly the same recall (e.g., some predictive coding systems produce a binary yes/no prediction instead of a relevance score, so the user cannot select the recall that will be achieved).

High recall (finding most of the relevant documents) is important in e-discovery for defensibility.  High precision is desirable to ensure that there aren’t a lot of non-relevant documents mixed in with the relevant ones (i.e., high precision reduces the cost of review for responsiveness and privilege).  Making judgments about the relative performance of two predictive coding systems knowing only a single precision-recall point for each system is problematic—if one system has higher recall but lower precision for a particular task, is it the better system for that task?

There are various performance measures like the F1 score that combine precision and recall into a single number to allow performance comparisons.  Unfortunately, such measures often assume a trade-off between precision and recall that is not appropriate for e-discovery (I’ve written about problems with the  F1 score before).  To understand the problem, it is useful to look at how F1 varies as a function of the recall where it is measured.  Here are two precision-recall curves, with the one on the left being for an easy categorization task and the one on the right being for a hard task, with the F1 score corresponding to each point on the precision-recall curve superimposed:

f1_compare2If we pick a single point from the precision-recall curve and compute the value of F1 for that point, the resulting F1 is very sensitive to the precision-recall point we choose.  F1 is maximized at 46% recall in the graph on the right, which means that the trade-off between precision and recall that F1 deems to be reasonable implies that it is not worthwhile to produce more than 46% of the relevant documents for that task because precision suffers too much when you push to higher recall.  That is simply not compatible with the needs of e-discovery.  In e-discovery, the trade-off  between precision (cost) and recall required should be dictated by proportionality, not by some performance measure that is oblivious to the value of the case.  Other problems with the F1 score are detailed in the paper.

The strong dependence that F1 has on recall as we move along the precision-recall curve means that it is easy to draw wrong conclusions about which system is performing better when performance is measured at different levels of recall.  This strong dependence on recall occurs because the contours of equal F1 are not shaped like precision-recall curves, so a precision-recall curve will cut across many contours.   In order to have the freedom to measure performance at recall levels that are relevant for e-discovery (e.g., 75% or higher) without drawing wrong conclusions about which system is performing best, the paper proposes a performance measure that has constant-performance contours that are shaped like precision-recall curves, so the performance measure depends much less on the recall level where the measurement is made than F1 does. In other words, the proposed performance measure aims to be sensitive to how well the system is working while being insensitive to the specific point on the precision-recall curve where the measurement is made.  This graph compares the constant-performance contours for F1 to the measure proposed in the paper:

f1_x_contours

Since the constant-performance contours are shaped like typical precision-recall curves, we can view this measure as being equivalent to extrapolating the precision-recall point to some other target recall level, like 75%, by simply finding an idealized precision-recall curve that passes through the point and moving along that curve to the target recall.  This figure illustrates extrapolation of precision measurements for three different systems at different recall levels to 75% recall for comparison:

x_extrapolate

Finally, here is what the performance measure looks like if we evaluate it for each point in the two precision-recall curves from the first figure:

x_compare2

The blue performance curves are much flatter than the red F1 curves from the first figure, so the value is much less sensitive to the recall level where it is measured.  As an added bonus, the measure is an extrapolated estimate of the precision that the system would achieve at 75% recall, so it is inversely proportional to the cost of the document review needed (excluding training and testing) to reach 75% recall.  For more details, read the paper or attend my talk at DESI VI.

Highlights from the NorCal eDiscovery & Information Governance Retreat 2015

The NorCal eDiscovery & Information Governance Retreat is part of the series of retreats 2015_norcal_outsideheld by Chris La Cour’s company, Ing3nious.  This one was held at the Meritage Resort & Spa in Napa, California.  As always, the venue was beautiful, the food was good, and the talks were informative.  You can find all of my photos from the retreat and the nearby Skyline Wilderness Park here.  My notes below offer a few highlights from the sessions I attended.  There were often two sessions occurring simultaneously, so I couldn’t attend everything.

Keynote: Only the Paranoid Survive: What eDiscovery Needs to Survive the Big Data Tsunami

The keynote was by Alex Ponce de Leon from Google.  He made the point that there is a 2015_norcal_keynote   difference between Big Data, which can be analyzed, and “lots and lots of data.”  For information governance, lots of data is a problem.  The excitement over Big Data (he showed this graph and this one) is turning people into digital hoarders–they are saving things that will never be useful, which causes problems for ediscovery.  He mentioned that DuPont analyzed the documents they had to review for a case and found that 50% of them should have been discarded according to their retention policy, resulting in $12 million in document review that wouldn’t have been necessary if the retention policy had been followed (this article discusses it).  Legal and ediscovery people need to take the lead in getting companies to not keep everything.

Establishing In-House eDiscovery Playbooks, Procedures, Tool Selection, and Implementation

There was some discussion about corporations acquiring e-discovery tools and whether that 2015_norcal_seminarcaused concerns from outside counsel about what was being done since they must sign off on it.  Ben Robbins of LinkedIn said they haven’t had significant problems with that.  The panel emphasized the importance of documenting procedures and making sure that different types of matters were addressed individually.

Cybersecurity…it’s what’s for dinner. So, what’s the recipe and who’s the head chef?

I couldn’t attend this one.

A Look Back on Model eDiscovery Orders

Judge Rader’s e-discovery model order (here is a related article), which limits discovery to five custodians and five search terms per custodian, was discussed.  It was motivated by a need to curtail patent trolls in the Eastern District of Texas who were using ediscovery costs as a weapon.  It was mentioned that discovery of backups may become more feasible as people move away from using tape for backups.  Producing reports rather than raw databases was discussed, with the point being made that standard reports are usually okay, but custom reports often don’t match the requesting party’s expectations and cause conflicts.  Model orders go out the window when dealing with government agencies–many want everything.

Information Governance and Security: Keeping Security in Sight

I couldn’t attend this one.

How to Leverage Information Governance for Better eDiscovery

I couldn’t attend this one.

Avoiding Land Mines in TAR

I was on this panel, so I didn’t take notes.

Managing BYOC/D and Wearables in International eDiscovery and Investigations

I couldn’t attend this one.

Social Media – eDiscovery’s “friend”?

An employee may see a social media account as personal, but it must be preserved (possibly 2015_norcal_lunchfor years).  Need to remind the employee of the hold.  Don’t friend represented opposition, but okay to friend witnesses if you are up front about why.  Lawyers can friend judges, but not if they have a case before them.  You should read your judge’s tweets to see if there is a sign of bias.  Getting data from a social media company is difficult.  Look to see if jurors are tweeting about the case.

Inside the Threat Matrix: Cyber Security Risks, Incident Response, and the Discovery Impact

I couldn’t attend this one.

Resolving the Transparency Paradox

TAR 1.0 has a lot of foreign concepts like “stabilization” (optimal training), whereas TAR 2.0 (continuous active learning) is more like traditional review.  Hal Marcus of Recommind mentioned that when he surveyed the audience at another event, many said they had used predictive coding but few disclosed doing so.  The panel discussed allowing the requesting party to provide a seed set to make them feel better about using TAR, or raising the possibility of using TAR early on to see if there is pushback.  The Coalition of Technology Resources for Lawyers has a database of case law on predictive coding that was mentioned.

Judicial Panel

Judges now get ediscovery.  They see a lack of communication.  Responding parties object to everything.  Judges are unlikely to interfere when the parties have a thought-out ediscovery plan.  Inside counsel are taking more control to reduce costs.  The RAND study “Where the Money Goes” was mentioned.  Regarding cost shifting, an attorney may choose to pay to have more control.

The Single Seed Hypothesis

This article shows that it is often possible to find the vast majority of the relevant documents in a collection by starting with a single relevant seed document and using continuous active learning (CAL).  This has important implications for making review efficient, making predictive coding practical for smaller document sets, and putting eyes on relevant documents as early as possible, perhaps leading to settlement before too much is spent on document review.  It also means that meticulously constructing seed sets and arguing about them with opposing counsel is probably a waste of time if CAL is used.

In one of the sessions at the ACEDS 2014 conference Bill Speros advocated using judgmental sampling (e.g., keyword search) to find relevant documents as training examples for predictive coding rather than using random sampling, which will not find many relevant documents if prevalence is low.  I thought to myself that, while I agree with the premise, he should really warn about the possibility of bias and the fact that any probability estimates or relevance scores generated by the classification algorithm could be very distorted.  To illustrate the problem of bias I decided that I would do an experiment when I got home.  I would start with a single relevant training document and see how many of the relevant documents the system could find.  I expected it to find only a subset of the relevant documents that were similar to the single seed document, showing that a set of seed documents that doesn’t have good coverage of the various relevant concepts for the case (i.e., a seed set that is biased) could miss pockets of relevant documents.  I would have found what I expected except that I started with continuous active learning (CAL) instead of simple passive learning (SPL), meaning that when I pulled a batch of documents predicted to be most likely to be relevant and reviewed them I allowed the system to learn from the tags that I applied so it could make better predictions when I pulled the next batch (that approach happened to be more convenient with our software at the time).  What I found was that as I pulled more and more batches of documents that were predicted to be relevant, allowing the system to update its predictions as I went, it continued to find relevant documents until I had well over 90% recall.  It never “got stuck on an island” where it couldn’t get to the remaining relevant documents because they were too different from the documents it had already seen.  It has taken me a year to get around to writing about this because I wanted to do more testing.

Weak Seed

Could the single seed document I picked have been unusually potent?  After achieving 90% recall I randomly selected one of the relevant documents the system didn’t find and started over using that document as the single seed document.  If the system had so much trouble finding that document, it must surely be rather different from the other relevant documents and would serve as a very weak seed.  I was able to hit 90% recall with CAL even with the single weak seed.  The figure below compares the single random seed (left) to the single weak seed (right) for passive learning (top row) and continuous active learning (bottom row).  Each bar represents a batch of documents, with the number of documents that were actually relevant represented in orange.  Click the figure for a larger view.

single_seed_random_v_weak_SPL_v_CAL

The weak seed is seen to be quite ineffective with passive learning (upper right graph).  It finds a modest number of relevant documents during the first three batches (no learning between batches).  After three batches there are no relevant documents left that are remotely similar to the seed document, so the system only finds relevant documents by tripping over them at random (prevalence is 2.6%).  The single random seed on the left did much better with passive learning than the weak seed, but it still ran out of gas while leaving many relevant documents undiscovered.  Both seeds worked well with CAL.  The first bar in the CAL chart for the weak seed shows it finding only a modest number of relevant documents (the same as SPL), but when it analyzes those documents and learns from them it is able to do much better with the second batch.  The initial disadvantage of the weak seed is quickly erased.

Wrong Seed

If a weak seed works, what about a seed that is totally wrong?  I picked a random document that was not remotely close to being relevant and tagged it as relevant.  To make things a little more interesting, I also tagged the random seed document from the test above as non-relevant.  So, my seed set consisted of two documents telling the system that non-relevant documents are relevant and relevant documents are non-relevant.  When I ask the system to give me batches of documents that are predicted to be relevant it will surely give me a bunch of non-relevant documents, which I would then tag correctly.  Will it be able find its way to the relevant documents in spite of starting out going in the completely wrong direction?  Here is the result:

single_wrong_seedAs expected, the first batch was purely non-relevant documents, but it was able to learn something from them–it learned what types of documents to avoid.  In the second batch it managed to stumble across a single relevant (and correctly tagged!) document.  The single seed hypothesis says that that single relevant document in the second batch should be enough to find virtually all of the relevant documents, and it was.  It hit 97% recall in the graph above (before I decided to stop it).  Also, the software warned me when the calculation was finished that the two seed documents appeared to be tagged incorrectly–my attempt at sabotage was detected!

Before proceeding with more experiments, I want to mention a point about efficiency.  When the system has seen only a single relevant training document it doesn’t know which words in the document made it relevant, so the first batch of predictions may not be very good–it may pick documents because they contain certain words seen in the seed document when those words were not particularly important.  As a result, it is more efficient to do a few small batches to allow it to sort out which words really matter before moving on to larger batches.  Optimal batch size depends on the task, the document set, and the classification algorithm, but small is generally better than large.

Disjoint Relevance

Maybe the relevant documents for the categorization task performed above were too homogeneous.  Could it find everything if my definition of relevance included documents that were extremely different from each other?  To test that I used a bunch of news articles and I defined a document to be relevant if it was about golf or biology.  There were no articles that were about both golf and biology.  If I seeded it with a single biology article, could it find the golf articles and vice versa?  This figure shows the results:

single_seed_bio_or_golfIt achieved 88% recall after reviewing 5.5% of the document population in both cases (prevalence was 3.6%).  The top graph was seeded with a single random biology article, whereas the bottom one was seeded with a single random golf article.  Golf articles are much easier for an algorithm to identify (once it has seen one in training) than biology articles.  The bottom graph should serve as a warning to not give up too soon if it looks like the system is no longer finding relevant documents.

How did it go from finding articles about biology to articles about golf?  The first golf article was found because it contained words like Watson, stole, Stanford, British, etc.  None of the words would be considered very strong indicators of relevance for biology.  When the low-hanging fruit has already been picked, the algorithm is going to start probing documents containing words that have been seen in relevant documents but whose importance is less certain.  If that leads to a new relevant document, the system is exposed to new words that may be good indicators of relevance (e.g., golf-related words in this case), leading to additional relevant documents in later batches.  If the documents it probes turn out to be non-relevant, it learns that those words weren’t good indicators of relevance and it heads in a different direction.  You can think of it as being like the Six Degrees of Kevin Bacon game–the algorithm can get from the single seed document to virtually any relevant document by hopping from one relevant document to another via the words they have in common, discovering new words that allow connecting to different documents as it goes.

Performance

If it is possible to find most of the relevant documents from a single seed, is it an efficient approach?  The figure below addresses that question.  The arrows indicate various recall levels.

optimal_SPL_v_single_seed_CAL

The first graph above shows SPL with the optimal amount of random training to reach 75% recall with the least total document review.  The first eight batches are the random training documents–you can see that those batches contain very few relevant documents.  After the eight training batches, documents that are predicted to be relevant are pulled in batches.  For SPL, the system is not allowed to learn beyond the initial random training documents.  The second graph shows CAL with the same set of random training documents.  You can see that it reached 75% recall more quickly, and it reached 88% recall with the amount of document review that SPL took to reach 75% recall.  The final graph shows CAL with a single seed.  You can see in the figure above that two small batches of documents predicted to be relevant were reviewed before moving to full-sized batches.

The figure shows that the single seed CAL result usually hit a recall level two or three batches later than the more heavily trained CAL result, but it also had nearly eight batches less of “training” data (I’m putting quotes around training here because all reviewed documents are really training documents with CAL–the system learns from all of them), so the improvement from the random training data (2 or 3 batches less of “review”) wasn’t sufficient to cover the cost (8 more batches of “training”).  The relative benefit of training with randomly selected documents may vary depending on the situation (e.g., reducing the “review” phase at the expense of more “training” for a larger document collection may be more worthwhile), but at least in the example above random sampling for training isn’t worthwhile beyond finding the first relevant seed document, which could probably be found more efficiently with keyword search.  Judgmental sampling may be worthwhile if it is good at finding a diverse set of relevant documents while avoiding non-relevant ones.

The table below shows the proportion of the document set that must be reviewed, including training documents, to reach 75% recall for several different categorization tasks with varying prevalence and difficulty.  In each case SPL was trained with a set of random documents with size optimized to achieve 75% recall with minimal review.  The result called simply “CAL” uses the same random training set as the SPL result but allows learning to continue when batches of relevant documents are pulled.  It would be unusual to use a large amount of random training documents with CAL, rather than using judgmental sampling, but I wanted to be able to show how much CAL improves on SPL with the same seed set and then show the additional benefit of reducing the seed set down to a single relevant document.

Task Prevalence SPL CAL CAL rand SS CAL weak SS
 1  6.9%  10.9%  8.3%  7.7%  7.6%
 2  4.1%  4.3%  3.7%  3.5%  3.5%
 3  2.9%  16.6%  12.5%  8.6%  8.8%
 4  1.1%  10.9%  6.7%  3.2%  3.2%
 5  0.68%  1.8%  1.8%  0.8%  0.9%
 6  0.52%  29.6%  8.4%  5.2%  7.1%
 7  0.32%  25.7%  17.1%  2.6%  2.6%

SPL_vs_CAL_performance

In every case CAL beat SPL, and a single seed (whether random or weak) was always better than using CAL with the full training set that was used for SPL.  Of course, it is possible that CAL with a smaller random seed set or a judgmental sample would be better than CAL with a single seed.

Short Documents

Since the algorithm hops from one relevant document to another by probing words that the documents have in common, will it get stuck if the documents are short because there are fewer words to explore?  To test that I truncated each document at 350 characters, being careful not to cut any words in half.  With an average of 95% of the document text removed, there will surely be some documents where all of the text that made them relevant is gone, so they’ll be tagged as relevant but there is virtually nothing in the text to justify considering them to be relevant, which will make performance metrics look bad.  This table gives the percentage of the document population that must be reviewed, including any training, to reach 75% recall compared to SPL (trained with optimal number of random docs to reach 75% recall):

Task Prevalence SPL CAL rand SS CAL weak SS
Full Docs 2.6%  5.0%  3.0%  3.1%
Truncated Docs 2.6%  19.4%  7.1%  6.9%

The table shows that CAL with a single seed requires much less document review than SPL regardless of whether the documents are long or short, and even a weak seed works with the short documents.

How Seed Sets Influence Which Documents are Found

I’ve shown that you can find most of the relevant documents with CAL using any single seed, but does the seed impact which relevant documents you find?  The answer is that it has very little impact.  Whatever seed you start with, the algorithm is going to want to move toward the relevant documents that are easiest for it to identify.  It may take a few batches before it gets exposed to the features that lead it to the easy documents (e.g., words that are strong indicators of relevance but are also fairly common, so it is easy to encounter them and they lead to a lot of relevant documents), but once it encounters the relevant documents that are easy to identify they will quickly overwhelm the few oddball relevant documents that may have come from a weak seed.  The predictions that generate later batches are heavily influenced by the relevant documents that are easy for the algorithm to find, and each additional batch of documents and associated learning erases more and more of impact from the starting seed.  To illustrate this point, I ran calculations for a difficult categorization task (relevant documents were scattered across many small concept clusters) and achieved exactly 75% recall using various approaches.  I then compared the results to see how many documents the different approaches had in common.  Each approach found 848 relevant documents.  Here is the overlap between the results:

Row Comparison Num Relevant Docs in Common
1 Algorithm 1: CAL Rand SS v. CAL Weak SS 822
2 Algorithm 2: CAL Rand SS v. CAL Weak SS 821
3 Algorithm 1 CAL Rand SS v. Algorithm 2 CAL Rand SS 724
4 Algorithm 1: SPL Rand1 v. SPL Rand2 724
5 Algorithm 1: SPL Rand1 v. CAL Rand SS 725
6 Algorithm 1: SPL Rand1 v. SPL Biased Seed 708
7 Algorithm 1: CAL Rand SS v. CAL Biased Seed 824

The maximum possible number of documents that two calculations can have in common is 848 (i.e., all of them), and the absolute minimum is 565 because if one approach gives 75% recall a second approach can find at most the 25% of the full set of relevant documents the first didn’t find and then it must resort to finding documents that the first approach found.  If two approaches were completely independent (think of one approach as picking relevant documents randomly), you would expect them to have 636 documents in common.  So, it is reasonable to expect the numbers in the right column of the table to lie between 636 and 848, and they probably shouldn’t get very close to 636 since no two predictive coding approaches should be completely independent because they will detect many of the same patterns that make a document relevant.

Row 1 of the table shows that CAL with the same algorithm and two different single seeds, one random and one weak, give nearly the same set of relevant documents, with 822 of the 848 relevant documents found being the same.  Row 2 shows that the result from Row 1 also holds for a different classification algorithm.  Row 3 shows that if we use two different classification algorithms but use the same seed, the agreement between the results is much lower at just 724 documents.  In other words, Rows 1-3 combined show that the specific relevant documents found with CAL depends much more on the classification algorithm used than on the seed document(s).

Row 4 shows that SPL with two different training sets of 4,000 random documents generates results with modest agreement, and Row 5 shows that the agreement is comparable between SPL trained with 4,000 random documents and CAL with a single random seed, so the CAL result is not particularly abnormal compared to SPL.

Row 6 compares SPL with 4,000 random training documents to SPL with a training set from a very biased search query that yielded 272 documents with 51 of them being relevant (prevalence for the full document population is 1.1%).  The biased training set with SPL gives a result that is more different from SPL with random training than anything else tested.  In other words, with SPL any bias in the training set is reflected in the final result.  Row 7 shows that when that same biased training set is fed to CAL it has virtually no impact on which relevant documents are returned–the result is almost the same as the single seed with CAL.

Other Classification Algorithms

Will other classification algorithms work with a single relevant seed document?  I tried a different (not terribly good) classification algorithm and it did work with a single seed, although it took somewhat more document review to reach the same level of recall.  Keep in mind that all predictive coding software is different (there are many layers of algorithms and many different algorithms at each layer), so your mileage may vary.  You should consult with your vendor when considering a different workflow, and always test the results for each case to ensure there are no surprises.  The algorithm’s hyperparameters (e.g., amount of regularization) may need to be adjusted for optimal performance with single seed CAL.

Can It Fail?

A study by Grossman and Cormack shows a case where the single seed hypothesis failed for Topic 203 in Table 6 on page 159.  They used a random sample containing two relevant documents and applied CAL, but the system had to go through 88% of the document population to find 75% of the relevant documents.  It didn’t merely do a bad job of finding relevant documents–it actively avoided them (worse than reviewing documents randomly)!  Gordon Cormack was kind enough to reply to my emails about this odd occurrence.  I’m not sure this one is fully understood (machine learning can get a little complicated under the hood), but I think it’s fair to say that there was a strange confluence between some odd data and the way the algorithm interpreted it that allowed the algorithm to get stuck and not learn.

Here are some things that I could see (pure conjecture without any testing) potentially causing problems.  If the document population contains multiple languages, I would not expect a single seed to be enough.  One relevant seed document per language would be required because I don’t think documents in different languages would have enough words in common for the algorithm to wander across a language boundary.  A classification algorithm that is too stiff (e.g., too much regularization) may fail to find significant pockets of documents–you really want the system to easily get pushed in different directions when a new relevant document is discovered.   If there is a particular type of document in the population that contains the same common chunk of text in every document while relevance is determined by some other part of the document, you may need a relevant seed document from that document type or the system may conclude that the common chunk of text that occurs in all documents of that type is such a strong non-relevance indicator that they may not be probed enough to learn that some of them are relevant.

Conclusions

My tests were performed on fairly clean documents with no near-dupes.  Actual e-discovery data can be uglier, and it can be hard to determine how a specific algorithm will react to it.  There is no guarantee that a single relevant seed document will be enough, but the experiments I’ve described should at least suggest that with CAL the seed set can be quite minimal, which allows relevant documents to be reviewed earlier in the process.  Avoiding large training sets also means that predictive coding with CAL can be worthwhile for smaller document collections.  Finally, with CAL, unlike SPL, the specific relevant documents that are found depend almost entirely on the algorithm used, not the seed set, so there is little point in arguing about seed set quality if CAL is used.

Can You Really Compete in TREC Retroactively?

I recently encountered a marketing piece where a vendor claimed that their tests showed their predictive coding software demonstrated favorable performance compared to the software tested in the 2009 TREC Legal Track for Topic 207 (finding Enron emails about fantasy football).  I spent some time puzzling about how they could possibly have measured their performance when they didn’t actually participate in TREC 2009.

One might question how meaningful it is to compare to performance results from 2009 since the TREC participants have probably improved their software over the past six years.  Still, how could you do the comparison if you wanted to?  The stumbling block is that TREC did not produce a yes/no relevance determination for all of the Enron emails.  Rather, they did stratified sampling and estimated recall and prevalence for the participating teams by producing relevance determinations for just a few thousand emails.

Stratified sampling means that the documents are separated into mutually-exclusive buckets called “strata.”  To the degree that stratification manages to put similar things into the same stratum, it can produce better statistical estimates (smaller uncertainty for a given amount of document review).  The TREC Legal Track for 2009 created a stratum containing documents that all participants agreed were relevant.  It also created four strata containing documents that all but one participant predicted were relevant (there were four participants, so one stratum for each dissenting participant).  There were six strata where two participants agreed on relevance, and four strata where only one of the four participants predicted the documents were relevant.  Finally, there was one stratum containing documents that all participants predicted were non-relevant, which was called the “All-N” stratum.  So, for each stratum a particular participant either predicted that all of the documents were relevant or they predicted that all of the documents were non-relevant.  You can view details about the strata in table 21 on page 39 here.  Here is an example of what a stratification might look like for just two participants (the number of documents shown and percentage that are relevant may differ from the actual data):

stratification

A random subset of documents from each stratum was chosen and reviewed so that the percentage of the documents in the stratum that were relevant could be estimated.  Multiplying that percentage by the number of documents in the stratum gives an estimate for the number of relevant documents in the stratum.  Combining the results for the various strata allows precision and recall estimates to be computed for each participant.  How could this be done for a team that didn’t participate?  Before presenting some ideas, it will be useful to have some notation:

N[i] = number of documents in stratum i
n[i] = num docs in i that were assessed by TREC
n+[i] = num docs in i that TREC assessed as relevant
V[i] = num docs in i that vendor predicted were relevant
v[i] = num docs in i that vendor predicted were relevant and were assessed by TREC
v+[i] = num docs in i that vendor predicted were relevant and assessed as relevant by TREC

sampling

To make some of the discussion below more concrete, I’ll provide formulas for computing the number of true positives (TP), false positives (FP), and false negatives (FN).  The recall and precision can then be computed from:

R = TP / (TP + FN)
P = TP / (TP + FP)

Here are some ideas I came up with:

1) They could have checked to see which strata the documents they predicted to be relevant fell into and applied the percentages TREC computed to their data.  The problem is that since they probably didn’t identify all of the documents in a stratum as being relevant the percentage of documents that were estimated to be relevant for the stratum by TREC wouldn’t really be applicable.  If their system worked really well, they may have only predicted that the truly relevant documents from the stratum were relevant.  If their system worked badly, their system may have predicted that only the truly non-relevant documents from the stratum were relevant.  This approach could give estimates that are systematically too low or too high.  Here are the relevant formulas (summing over strata, i):

TP = Sum{ V[i] * n+[i] / n[i] }
FP = Sum{ V[i] * (1 – n+[i]/n[i]) }
FN = Sum{ (N[i] – V[i]) * n+[i] / n[i] }

2) Instead of using the percentages computed by TREC, they could have computed their own percentages by looking at only the documents in the stratum that they predicted were relevant and were reviewed by TREC to give a relevance determination.  This would eliminate the possible bias from approach (1), but it also means that the percentages would be computed from a smaller sample, so the uncertainty in the percentage that are relevant would be bigger.  The vendor didn’t provide confidence intervals for their results.  Here is how the computation would go:

TP = Sum{ V[i] * v+[i] / v[i] }
FP = Sum{ V[i] * (1 – v+[i]/v[i]) }
FN = Sum{ (N[i] – V[i]) * (n+[i] – v+[i]) / (n[i] – v[i]) }

It’s possible that for some strata there would be no overlap between the documents TREC assessed and the documents the vendor predicted to be relevant since TREC typically assessed only about 4% of each stratum for Topic 207 (except the All-N stratum, where they assessed only 0.46%).  This approach wouldn’t work for those strata since v[i] would be 0.  For strata where v[i] is 0, one might use approach (1) and hope it isn’t too wrong.

3) A more sophisticated tweak on (2) would be to use the ratio n+[i]/n[i] from (1) to generate a Bayesian prior probability distribution for the proportion of documents predicted by the vendor to be relevant that actually are relevant, and then use v+[i] and v[i] to compute a posterior distribution for that proportion and use the mean of that distribution instead of v+[i]/v[i] in the computation. The idea is to have a smooth interpolation between using n+[i]/n[i] and using v+[i]/v[i] as the proportion of documents estimated to be relevant, where the interpolation would be closer to v+[i]/v[i] if v[i] is large (i.e., if there is enough data for v+[i]/v[i] to be reasonably accurate).  The result would be sensitive to choices made in creating the Bayesian prior (i.e., how much variance to give the probability distribution), however.

4) They could have ignored all of the documents that weren’t reviewed in TREC (over 500,000 of them) and just performed their predictions and analysis on the 3,709 documents that had relevance assessments (training documents should come from the set TREC didn’t assess and should be reviewed by the vendor to simulate actual training at TREC being done by the participants).  It would be very important to weight the results to compensate for the fact that those 3,709 documents didn’t all have the same probability of being selected for review.  TREC oversampled the documents that were predicted to be relevant compared to the remainder (i.e., the number of documents sampled from a stratum was not simply proportional to the number of documents in the stratum), which allowed their stratification scheme to do a good job of comparing the participating teams to each other at the expense of having large uncertainty for some quantities like the total number of relevant documents.  The prevalence of relevant documents in the full population was 1.5%, but 9.0% of the documents having relevance assessments were relevant.  Without weighting the results to compensate for the uneven sampling, you would be throwing away over half a million non-relevant documents without giving the system being tested the opportunity to incorrectly predict that some of them are relevant, which would lead to an inflated precision estimate.  The expression “shooting fish in a barrel” comes to mind.  Weighting would be accomplished by dividing by the probability of the document having been chosen (after this article was published I learned that this is called the Horvitz-Thompson estimator, and it is what the TREC evaluation toolkit uses), which is just n[i]/N[i], so the computation would be:

TP = Sum{ (N[i]/n[i]) * v+[i] }
FP = Sum{ (N[i]/n[i]) * (v[i] – v+[i]) }
FN = Sum{ (N[i]/n[i]) * (n+[i] – v+[i]) }

Note that if N[i]/n[i] is equal to V[i]/v[i], which is expected to be approximately true since the subset of a stratum chosen for assessment by TREC is random, the result would be equal to that from (2).  If N[i]/n[i] is not equal to V[i]/v[i] for a stratum, we would have the disturbing result that the estimate for TP+FP for that stratum would not equal the number of documents the vendor predicted to be relevant for that stratum, V[i].

5) The vendor could have ignored the TREC relevance determinations, simply doing their own.  That would be highly biased in the vendor’s favor because there would be a level of consistency between relevance determinations for the training data and testing data that did not exist for TREC participants.  At TREC the participants made their own relevance determinations to train their systems and a separate set of Topic Authorities made the final relevance judgments that determined the performance numbers.  To the degree that participants came to different conclusions about relevance compared to the Topic Authorities, their performance numbers would suffer.  A more subtle problem with this approach is that the vendor’s interpretation of the relevance criteria would inevitably be somewhat different from that of TREC assessors (studies have shown poor agreement between different review teams), which could make the classification task either easier or harder for a computer.  As an extreme example, if the vendor took all documents containing the word “football” to be relevant and all other documents to be non-relevant, it would be very easy for a predictive coding system to identify that pattern and achieve good performance numbers.

Approaches (1)-(4) would all give the same results for the original TREC participants because for each stratum they would either have V[i]=0 (so v[i]=0 and v+[i]=0) or they would have V[i]=N[i] (so v[i]=n[i] and v+[i]=n+[i]).  The approaches differ in how they account for the vendor predicting that only a subset of a stratum is relevant.  None of the approaches described are great.  Is there a better approach that I missed? TREC designed their strata to make the best possible comparisons between the participants.  It’s hard to imagine how an analysis could be as accurate for a system that was not taken into account in the stratification process.  If a vendor is tempted to make such comparisons, they should at least disclose their methodology and provide confidence intervals on their results so prospective clients can determine whether the performance numbers are actually meaningful.

Comments on Rio Tinto v. Vale and Sample Size

Judge Peck recently issued an opinion in Rio Tinto PLC v. Vale SA, et al, Case 1:14-cv-03042-RMB-AJP where he spent some time reflecting on the state of court acceptance of technology-assisted review (a.k.a. predictive coding).  The quote that will surely grab headlines is on page 2: “In the three years since Da Silva Moore, the case law has developed to the point that it is now black letter law that where the producing party wants to utilize TAR for document review, courts will permit it.”  He lists the relevant cases and talks a bit about transparency and disclosing seed sets.  It is certainly worth reading.

Both parties in Rio Tinto v. Vale have agreed to disclose all non-privileged documents, including non-responsive documents, from their control sets, seed sets, and training sets. Judge Peck accepts their protocol because they both agree to it, but hints that disclosing seed sets may not really be necessary (p. 6, “…requesting parties can insure that training and review was done appropriately by other means…”).

I find one other aspect of the protocol the litigants proposed to be worthy of comment.  They make a point of defining a “Statistically Valid Sample” on p. 11 to be one that gives +/- 2% margin of error at 95% confidence, and even provide an equation to compute the sample size in footnote 2.  Their equation gives a sample size of at most 2,395 documents, depending on prevalence.  They then use the “Statistically Valid Sample” term in contexts where it isn’t (as they’ve defined it) directly appropriate.  I don’t know if this is just sloppiness (missing details about what they actually plan to do) or a misunderstanding of statistics.

For example, section 4.a.ii on p. 13 contemplates culling before application of predictive coding, and says they will “Review a Statistically Valid Sample from the Excluded Documents.”  Kudos to them for actually measuring how many relevant documents they are culling instead of just assuming that keyword search results should be good enough without any analysis, but 2,395 documents is not the right sample size.  The more documents you are culling, the more precisely you need to know what proportion of them were relevant in order to have a reasonably precise value for the number of relevant documents culled, which is what matters for computing recall.  In other words, a +/- 2% measurement on the culled set does not mean +/- 2% for recall.  I described a similar situation in more detail in my Predictive Coding Confusion article under the heading “Beware small percentages of large numbers.”  My eRecall: No Free Lunch article also discusses similar issues.

Section 4.b on p. 13 says that the control set will be a Statistically Valid Sample that will be used to measure prevalence.  They explain in a separate letter to Judge Peck on p. 9 that the control set will be used to track progress by estimating precision and recall. Do they intend to use 2,395 (or fewer) documents for the control set?  Suppose only one of the 2,395 documents is actually relevant.  That would give a prevalence estimate of 0.0011% to 0.2321% with 95% confidence (via this calculator), which is certainly better than the required +/- 2%, but it is useless for tracking progress because the uncertainty is huge compared to the value itself.  If they had a million documents, the estimate would tell them that somewhere between 11 and 2,321 of them are relevant.  So, if they found 11 relevant documents with their predictive coding software they would estimate that they achieved somewhere between 0.5% and 100% recall.  To look at it a little differently, if they looked at their system’s prediction for the control set they would find that it either correctly predicted that the one relevant document was relevant (100% recall) or they would find that it was predicted incorrectly (0% recall), with dumb luck being a big factor in which result they got.

Maybe they intended that the control set contain 2,395 relevant documents, which would give a recall estimate accurate to +/- 2% with 95% confidence (more precise than really seems worthwhile for a control set) by measuring the percentage of relevant documents in the control set that are predicted correctly.  If prevalence is 10%, the control set would need to contain about 23,950 documents to have 2,395 that are relevant.  If prevalence is 1%, the control set would require about 239,500 documents.  That sure seems like a lot of documents to review just to create a control set.  The point is that it is the number of relevant documents in the control set, not the number of documents, that determines how precisely the control set can measure recall.  Their protocol does say that the requesting party will have ten business days to check the control set if it is more than 4,000 documents, so it does seem that they’ve contemplated the possibility of using more than 2,395 documents in the control set, but the details of what they are really planning to do are missing.  Of course, the control set is there to help the producing party optimize their process, so it is their loss if they get it wrong (assuming there is separate testing that would detect the problem, as described in section 4.f).

Finally, section 4.f on p. 16 talks about taking a Statistically Valid Sample from the documents that are predicted to be non-relevant to estimate the number of relevant documents that were missed by predictive coding, leading to a recall estimate.  This has the same problem as the culling in section 4.a.ii — the size of the sample that is required to achieve a desired level of uncertainty in the recall depends on the size of the set of documents being culled, whether the culling is due to keyword searching before applying predictive coding or whether the culling is due to discarding documents that the predictive coding system predicts are non-relevant.

If the goal is to arrive at a reasonably precise estimate of recall (and, I’m certainly not arguing that +/- 2% should be required), it is important to keep track of how the uncertainty from each sample propagates through to the final recall result (e.g. it may be multiplied by some large number of culled documents) when choosing an appropriate sample size.  I may be nitpicking, but it strikes me as odd to lay out a specific formula for calculating sample size and then not mention that it cannot be applied directly for the sampling that is actually being contemplated.

Highlights from the East Coast eDiscovery & IG Retreat 2014

The East Coast eDiscovery & IG Retreat is a new addition to the series of retreats held by Chris La Cour’s company, Ing3nious.  It was held at the Chatham Ba2014_east_coast_edisc_main_buildingrs Inn in Chatham, Massachusetts.  This is the fourth year Chris has been organizing retreats, with the number of retreats and diversity of themes increasing in recent years, but this is the first one held outside of California.  As always, the venue was beautiful (more photos here), and the conference was informative and well-organized.  My notes below only capture a small amount of the information presented.  There were often two simultaneous sessions, so I couldn’t attend everything.

Keynote: Information Governance in a Predictive World

Big data isn’t just about applying the same old analysis to more data.  It’s about real-time or near-real-time, action-oriented analysis.  It allows asking new questions.  Technology now exists to allow police to scan license plates while driving through a parking lot to check for stolen cars.  Amazon may implement predictive shipping, where it ships a product to a customer before the customer orders it, which requires predictions with high confidence.  Face recognition technology is already in use at airports to track who is entering and whether they belong there.  When he tweeted a complaint about an airline, he got a personalized reply from the airline on twitter within 30 seconds, thanks to technology.2014_east_coast_edisc_keynote

Proactive information governance will allow problems (sexual harassment, fraud) to be detected immediately so they can be corrected, instead of finding out later when there is a lawsuit.  It is possible to predict when someone will leave a company — they may become short with people, or spend more time on LInkedIn.

Change will come to the Federal Rules of Civil Procedure in 2015.  Rule 37(e) on sanctions for spoliation will change from “willful or bad faith” to “willful and bad faith” to encourage more deletion.  Have a policy to delete as much as possible, follow it, and be prepared to prove that you follow it.

Case Study: The Swiss Army Knife Approach to eDiscovery

Predictive coding failed for a case where the documents were very homogeneous (keywords didn’t work either; the document set was large, but not too large for eyes-on review).  It also failed when there were too many issue codes.  Predictive coding had problems with Spanish documents where there many different dialects.  Predictive coding works well for prioritizing documents when there is a quick deposition schedule.

Email domain name filtering can be used to remove junk or to detect things like gmail accounts that may be relevant.  Also, look for gaps in dates for emails — it could be that the person was on vacation, or maybe something was removed.2014_east_coast_edisc_beach

Clustering or near-dupe is useful to ensure consistency of redactions.

There is a benefit to reviewing the documents yourself to understand the case.  Not a fan of producing documents you haven’t seen.

May need to do predictive coding or clustering in a foreign country to minimize the amount of data that needs to be brought to the U.S..

Talk to custodians about acronyms.  Look at word lists — watch for unusual words, or how word usage corresponds to timing of events.

Real-World Impact of Information Governance on the eDiscovery Process

I couldn’t attend this one.

Future-think: How Will eDiscovery Be Different in 5-10 Years?

Cost cutting: Cull in house, and targeted collections.  The big obstacle is getting people to learn technology.  Some people still print email to read it.

Analysis of cases is accelerating.  Even small cases are impacted by technology.  For example, information about a car accident is recorded.  In the future, ESI will be collected and a conclusion will be reached without a trial.  Human memory is obsolete — everything will be ESI.

Personal devices may be subject to discovery.  Employment agreements should make that clear in advance.

Privacy will be a big legal field.  Information is even collected about children — what they eat at school and when they are on the bus.

Will schools be held liable for student loans if the school fails to predict that the student will fail?

There are concerns about security of the cloud.

Businesses should demand change to make things more efficient, like arbitration.

Information Governance – Teams, Litigation Holds and Best Practices

I couldn’t attend this one.

Recent Developments in Technology Assisted Review — Is TAR Gaining Traction?

Three panelists said predictive coding didn’t work for them for identifying privileged documents.  One panelist (me) said he had a case where it worked well for priv docs.  Although I didn’t mention it at the time since I wouldn’t be able to get the reference right off of the top of my head, there is a paper with a nice discussion about the issues around finding priv docs that also claims success at using predictive coding.

What level of recall is necessary?  There seemed to be consensus that 75% was acceptable for most purposes, but people sometimes aim higher, depending on the circumstances (e.g., to ward off objections from the other side).

Is it OK to use keyword search to cull down the document population before applying predictive coding?  Must be careful not to cull away too many of the relevant documents (e.g., the Biomet case).

There is a lot of concern about being required to turn over training documents (especially non-responsive ones) to the other side.  I pointed out that it is not like turning over search terms.  It is very clear whether or not a document matches a search query, but disclosing training documents does not tell what predictions a particular piece of predictive coding software will give.  In fact, some software will (hopefully, rarely) fail to produce documents that are near-dupes of relevant training documents, so one should not assume that the disclosure of training documents guarantees anything about what will be produced.  There was concern that disclosure of non-relevant training documents by some parties will set a bad precedent.

Top 5 Trends in Discovery for 2014

I couldn’t attend this one.

Recruiting the best eDiscovery Team

Cybersecurity is a concern.  It is important to vet service providers.  Many law firms are not as well protected as one would like.2014_east_coast_edisc_reception

When required to give depositions about e-discovery process, paralegals can do well.  IT people tend to get stressed.  Lawyers can be too argumentative.

Need a champion to encourage everyone to get things done.

Legal hold is often drafted by outside counsel but enforced by in-house counsel.

Don’t have custodians do their own self-collection (e.g., based on search terms), but may have IT do collection (less expensive than using outside consultant, but must be able to explain what they did).

Information governance and changes to FRCP will reduce costs over the next five years.

eRecall: No Free Lunch

There has been some debate recently about the value of the “eRecall” method compared to the “Direct Recall” method for estimating the recall achieved with technology-assisted review. This article shows why eRecall requires sampling and reviewing just as many documents as the direct method if you want to achieve the same level of certainty in the result.

Here is the equation:
eRecall = (TotalRelevant – RelevantDocsMissed) / TotalRelevant

Rearranging a little:
eRecall = 1 – RelevantDocsMissed / TotalRelevant
= 1 – FractionMissed * TotalDocumentsCulled / TotalRelevant

It requires estimation (via sampling) of two quantities: the total number of relevant documents, and the number of relevant documents that were culled by the TAR tool. If your approach to TAR involves using only random sampling for training, you may have a very good estimate of the prevalence of relevant documents in the full population by simply measuring it on your (potentially large) training set, so you multiply the prevalence by the total number of documents to get TotalRelevant. To estimate the number of relevant documents missed (culled by TAR), you would need to review a random sample of the culled documents to measure the percentage of them that were relevant, i.e. FractionMissed (commonly known as the false omission rate or elusion). How many?

To simplify the argument, let’s assume that the total number of relevant documents is known exactly, so there is no need to worry about the fact that the equation involves a non-linear combination of two uncertain quantities.  Also, we’ll assume that the prevalence is low, so the number of documents culled will be nearly equal to the total number of documents.  For example, if the prevalence is 1% we might end up culling about 95% to 98% of the documents.  With this approximation, we have:

eRecall = 1 – FractionMissed / Prevalence

It is the very small prevalence value in the denominator that is the killer–it amplifies the error bar on FractionMissed, which means we have to take a ton of samples when measuring FractionMissed to achieve a reasonable error bar on eRecall.

Let’s try some specific numbers.  Suppose the prevalence is 1% and the recall (that we’re trying to estimate) happens to be 75%.  Measuring FractionMissed should give a result of about 0.25% if we take a big enough sample to have a reasonably accurate result.  If we sampled 4,000 documents from the culled set and 10 of them were relevant (i.e., 0.25%), the 95% confidence interval for FractionMissed would be (using an exact confidence interval calculator to avoid getting bad results when working with extreme values, as I advocated in a previous article):

FractionMissed = 0.12% to 0.46% with 95% confidence (4,000 samples)

Plugging those values into the eRecall equation gives a recall estimate ranging from 54% to 88% with 95% confidence.  Not a very tight error bar!

If the number of samples was increased to 40,000 (with 100 being relevant, so 0.25% again), we would have:

FractionMissed = 0.20% to 0.30% with 95% confidence (40,000 samples)

Plugging that into the eRecall equation gives a recall estimate ranging from 70% to 80% with 95% confidence, so we have now reached the ±5% level that people often aim for.

For comparison, the Direct Recall method would involve pulling a sample of 40,000 documents from the whole document set to identify roughly 400 random relevant documents, and finding that roughly 300 of the 400 were correctly predicted by the TAR system (i.e., 75% recall).  Using the calculator with a sample size of 400 and 300 relevant (“relevant” for the calculator means correctly-identified for our purposes here) gives a recall range of 70.5% to 79.2%.

So, the number of samples required for eRecall is about the same as the Direct Recall method if you require a comparable amount of certainty in the result.  There’s no free lunch to be found here.