Comments on Pyrrho Investments v. MWB Property and TAR vs. Manual Review

A recent decision by Master Matthews in Pyrrho Investments v. MWB Property seems to be the first judgment by a UK court allowing the use of predictive coding.  This article comments on a few aspects of the decision, especially the conclusion about how predictive coding (or TAR) performs compared to manual review.

The decision argues that predictive coding is not prohibited by English law and that it is reasonable based on proportionality, the details of the case, and expected accuracy compared to manual review.  It recaps the Da Silva Moore v. Publicis Group case from the US starting at paragraph 26, and the Irish Bank Resolution Corporation v. Quinn case from Ireland starting at paragraph 31.

Paragraph 33 enumerates ten reasons for approving predictive coding.  The second reason on the list is:

There is no evidence to show that the use of predictive coding software leads to less accurate disclosure being given than, say, manual review alone or keyword searches and manual review combined, and indeed there is some evidence (referred to in the US and Irish cases to which I referred above) to the contrary.

The evidence referenced includes the famous Grossman & Cormack JOLT study, but that study only analyzed the TAR systems from TREC 2009 that had the best results.  If you look at all of the TAR results from TREC 2009, as I did in Appendix A of my book, many of the TAR systems found fewer relevant documents (albeit at much lower cost) than humans performing manual review. This figure shows the number of relevant documents found:

Number of relevant documents found for five categorization tasks. The vertical scale always starts at zero. Manual review by humans is labeled "H." TAR systems analyzed by Grossman and Cormack are "UW" and "H5." Error bars are 95% confidence intervals.

Number of relevant documents found for five categorization tasks. The vertical scale always starts at zero. Manual review by humans is labeled “H.” TAR systems analyzed by Grossman and Cormack are “UW” and “H5.” Error bars are 95% confidence intervals.

If a TAR system generates relevance scores rather than binary yes/no relevance predictions, any desired recall can be achieved by producing all documents having relevance scores above an appropriately calculated cutoff.  Aiming for high recall with a system that is not working well may mean producing a lot of non-relevant documents or performing a lot of human review on the documents predicted to be relevant (i.e., documents above the relevance score cutoff) to filter out the large number of non-relevant documents that the system failed to separate from the relevant ones (possibly losing some relevant documents in the process due to reviewer mistakes).  If it is possible (through enough effort) to achieve high recall with a system that is performing poorly, why were so many TAR results far below the manual review results?  TREC 2009 participants were told they should aim to maximize their F1 scores (F1 is not a good choice for e-discovery).  Effectively, participants were told to choose their relevance score cutoffs in a way that tried to balance the desire for high recall with other concerns (high precision).  If a system wasn’t performing well, maximizing F1 meant either accepting low recall or reviewing a huge number of documents to achieve high recall without allowing too many non-relevant documents to slip into the production.

The key point is that the number of relevant documents found depends on how the system is used (e.g., how the relevance score cutoff is chosen).  The amount of effort required (amount of human document review) to achieve a desired level of recall depends on how well the system and training methodology work, which can vary quite a bit (see this article).  Achieving results that are better than manual review (in terms of the number of relevant documents found) does not happen automatically just because you wave the word “TAR” around.  You either need a system that works well for the task at hand, or you need to be willing to push a poor system far enough (low relevance score cutoff and lots of document review) to achieve good recall.  The figure above should make it clear that it is possible for TAR to give results that fall far short of manual review if it is not pushed hard enough.

The discussion above focuses on the quality of the result, but the cost of achieving the result is obviously a significant factor.  Page 14 of the decision says the case involves over 3 million documents and the cost of the predictive coding software is estimated to be between £181,988 and £469,049 (plus hosting costs) depending on factors like the number of documents culled via keyword search.  If we assume the high end of the price range applies to 3 million documents, that works out to $0.22 per document, which is about ten times what it could be if they shopped around, but still much cheaper than human review.

TAR 3.0 Performance

This article reviews TAR 1.0, 2.0, and the new TAR 3.0 workflow.  It then compares performance on seven categorization tasks of varying prevalence and difficulty.  You may find it useful to read my article on gain curves before reading this one.

In some circumstances it may be acceptable to produce documents without reviewing all of them.  Perhaps it is expected that there are no privileged documents among the custodians involved, or maybe it is believed that potentially privileged documents will be easy to find via some mechanism like analyzing email senders and recipients.  Maybe there is little concern that trade secrets or evidence of bad acts unrelated to the litigation will be revealed if some non-relevant documents are produced.  In such situations you are faced with a dilemma when choosing a predictive coding workflow.  The TAR 1.0 workflow allows documents to be produced without review, so there is potential for substantial savings if TAR 1.0 works well for the case in question, but TAR 1.0 sometimes doesn’t work well, especially when prevalence is low.  TAR 2.0 doesn’t really support producing documents without reviewing them, but it is usually much more efficient than TAR 1.0 if all documents that are predicted to be relevant will be reviewed, especially if the task is difficult or prevalence is low.

TAR 1.0 involves a fair amount of up-front investment in reviewing control set documents and training documents before you can tell whether it is going to work well enough to produce a substantial number of documents without reviewing them.  If you find that TAR 1.0 isn’t working well enough to avoid reviewing documents that will be produced (too many non-relevant documents would slip into the production) and you resign yourself to reviewing everything that is predicted to be relevant, you’ll end up reviewing more documents with TAR 1.0 than you would have with TAR 2.0.  Switching from TAR 1.0 to TAR 2.0 midstream is less efficient than starting with TAR 2.0. Whether you choose TAR 1.0 or TAR 2.0, it is possible that you could have done less document review if you had made the opposite choice (if you know up front that you will have to review all documents that will be produced due to the circumstances of the case, TAR 2.0 is almost certainly the better choice as far as efficiency is concerned).

TAR 3.0 solves the dilemma by providing high efficiency regardless of whether or not you end up reviewing all of the documents that will be produced.  You don’t have to guess which workflow to use and suffer poor efficiency if you are wrong about whether or not producing documents without reviewing them will be feasible.  Before jumping into the performance numbers, here is a summary of the workflows (you can find some related animations and discussion in the recording of my recent webinar):

TAR 1.0 involves a training phase followed by a review phase with a control set being used to determine the optimal point when you should switch from training to review.  The system no longer learns once the training phase is completed.  The control set is a random set of documents that have been reviewed and marked as relevant or non-relevant.  The control set documents are not used to train the system.  They are used to assess the system’s predictions so training can be terminated when the benefits of additional training no longer outweigh the cost of additional training.  Training can be with randomly selected documents, known as Simple Passive Learning (SPL), or it can involve documents chosen by the system to optimize learning efficiency, known as Simple Active Learning (SAL).

TAR 2.0 uses an approach called Continuous Active Learning (CAL), meaning that there is no separation between training and review–the system continues to learn throughout.  While many approaches may be used to select documents for review, a significant component of CAL is many iterations of predicting which documents are most likely to be relevant, reviewing them, and updating the predictions.  Unlike TAR 1.0, TAR 2.0 tends to be very efficient even when prevalence is low.  Since there is no separation between training and review, TAR 2.0 does not require a control set.  Generating a control set can involve reviewing a large (especially when prevalence is low) number of non-relevant documents, so avoiding control sets is desirable.

TAR 3.0 requires a high-quality conceptual clustering algorithm that forms narrowly focused clusters of fixed size in concept space.  It applies the TAR 2.0 methodology to just the cluster centers, which ensures that a diverse set of potentially relevant documents are reviewed.  Once no more relevant cluster centers can be found, the reviewed cluster centers are used as training documents to make predictions for the full document population.  There is no need for a control set–the system is well-trained when no additional relevant cluster centers can be found. Analysis of the cluster centers that were reviewed provides an estimate of the prevalence and the number of non-relevant documents that would be produced if documents were produced based purely on the predictions without human review.  The user can decide to produce documents (not identified as potentially privileged) without review, similar to SAL from TAR 1.0 (but without a control set), or he/she can decide to review documents that have too much risk of being non-relevant (which can be used as additional training for the system, i.e., CAL).  The key point is that the user has the info he/she needs to make a decision about how to proceed after completing review of the cluster centers that are likely to be relevant, and nothing done before that point becomes invalidated by the decision (compare to starting with TAR 1.0, reviewing a control set, finding that the predictions aren’t good enough to produce documents without review, and then switching to TAR 2.0, which renders the control set virtually useless).

The table below shows the amount of document review required to reach 75% recall for seven categorization tasks with widely varying prevalence and difficulty.  Performance differences between CAL and non-CAL approaches tend to be larger if a higher recall target is chosen.  The document population is 100,000 news articles without dupes or near-dupes.  “Min Total Review” is the number of documents requiring review (training documents and control set if applicable) if all documents predicted to be relevant will be produced without review.  “Max Total Review” is the number of documents requiring review if all documents predicted to be relevant will be reviewed before production.  None of the results include review of statistical samples used to measure recall, which would be the same for all workflows.

Task 1 2 3 4 5 6 7
Prevalence 6.9% 4.1% 2.9% 1.1% 0.68% 0.52% 0.32%
TAR 1.0 SPL Control Set 300 500 700 1,800 3,000 3,900 6,200
Training (Random) 1,000 300 6,000 3,000 1,000 4,000 12,000
Review Phase 9,500 4,400 9,100 4,400 900 9,800 2,900
Min Total Review 1,300 800 6,700 4,800 4,000 7,900 18,200
Max Total Review 10,800 5,200 15,800 9,200 4,900 17,700 21,100
TAR 3.0 SAL Training (Cluster Centers) 400 500 600 300 200 500 300
Review Phase 8,000 3,000 12,000 4,200 900 8,000 7,300
Min Total Review 400 500 600 300 200 500 300
Max Total Review 8,400 3,500 12,600 4,500 1,100 8,500 7,600
TAR 3.0 CAL Training (Cluster Centers) 400 500 600 300 200 500 300
Training + Review 7,000 3,000 6,700 2,400 900 3,300 1,400
Total Review 7,400 3,500 7,300 2,700 1,100 3,800 1,700

Producing documents without review with TAR 1.0 sometimes results in much less document review than using TAR 2.0 (which requires reviewing everything that will be produced), but sometimes TAR 2.0 requires less review.


The size of the control set for TAR 1.0 was chosen so that it would contain approximately 20 relevant documents, so low prevalence requires a large control set.  Note that the control set size was chosen based on the assumption that it would be used only to measure changes in prediction quality.  If the control set will be used for other things, such as recall estimation, it needs to be larger.

The number of random training documents used in TAR 1.0 was chosen to minimize the Max Total Review result (see my article on gain curves for related discussion).  This minimizes total review cost if all documents predicted to be relevant will be reviewed and if the cost of reviewing documents in the training phase and review phase are the same.  If training documents will be reviewed by an expensive subject matter expert and the review phase will be performed by less expensive reviewers, the optimal amount of training will be different.  If documents predicted to be relevant won’t be reviewed before production, the optimal amount of training will also be different (and more subjective), but I kept the training the same when computing Min Total Review values.

The optimal number of training documents for TAR 1.0 varied greatly for different tasks, ranging from 300 to 12,000.  This should make it clear that there is no magic number of training documents that is appropriate for all projects.  This is also why TAR 1.0 requires a control set–the optimal amount of training must be measured.

The results labeled TAR 3.0 SAL come from terminating learning once the review of cluster centers is complete, which is appropriate if documents will be produced without review (Min Total Review).  The Max Total Review value for TAR 3.0 SAL tells you how much review would be required if you reviewed all documents predicted to be relevant but did not allow the system to learn from that review, which is useful to compare to the TAR 3.0 CAL result where learning is allowed to continue throughout.  In some cases where the categorization task is relatively easy (tasks 2 and 5) the extra learning from CAL has no benefit unless the target recall is very high.  In other cases CAL reduces review significantly.

I have not included TAR 2.0 in the table because the efficiency of TAR 2.0 with a small seed set (a single relevant document is enough) is virtually indistinguishable from the TAR 3.0 CAL results that are shown.  Once you start turning the CAL crank the system will quickly head toward the relevant documents that are easiest for the classification algorithm to identify, and feeding those documents back in for training quickly floods out the influence of the seed set you started with.  The only way to change the efficiency of CAL, aside from changing the software’s algorithms, is to waste time reviewing a large seed set that is less effective for learning than the documents that the algorithm would have chosen itself.  The training done by TAR 3.0 with cluster centers is highly effective for learning, so there is no wasted effort in reviewing those documents.

To illustrate the dilemma I pointed out at the beginning of the article, consider task 2.  The table shows that prevalence is 4.1%, so there are 4,100 relevant documents in the population of 100,000 documents.  To achieve 75% recall, we would need to find 3,075 relevant documents.  Some of the relevant documents will be found in the control set and the training set, but most will be found in the review phase.  The review phase involves 4,400 documents.  If we produce all of them without review, most of the produced documents will be relevant (3,075 out of a little more than 4,400).  TAR 1.0 would require review of only 800 documents for the training and control sets.  By contrast, TAR 2.0 (I’ll use the Total Review value for TAR 3 CAL as the TAR 2.0 result) would produce 3,075 relevant documents with no non-relevant ones (assuming no mistakes by the reviewer), but it would involve reviewing 3,500 documents.  TAR 1.0 was better than TAR 2.0 in this case (if producing over a thousand non-relevant documents is acceptable).  TAR 3.0 would have been an even better choice because it required review of only 500 documents (cluster centers) and it would have produced fewer non-relevant documents since the review phase would involve only 3,000 documents.

Next, consider task 6.  If all 9,800 documents in the review phase of TAR 1.0 were produced without review, most of the production would be non-relevant documents since there are only 520 relevant documents (prevalence is 0.52%) in the entire population!  That shameful production would occur after reviewing 7,900 documents for training and the control set, assuming you didn’t recognize the impending disaster and abort before getting that far.  Had you started with TAR 2.0, you could have had a clean (no non-relevant documents) production after reviewing just 3,800 documents.  With TAR 3.0 you would realize that producing documents without review wasn’t feasible after reviewing 500 cluster center documents and you would proceed with CAL, reviewing a total of 3,800 documents to get a clean production.

Task 5 is interesting because production without review is feasible (but not great) with respect to the number of non-relevant documents that would be produced, but TAR 1.0 is so inefficient when prevalence is low that you would be better off using TAR 2.0.  TAR 2.0 would require reviewing 1,100 documents for a clean production, whereas TAR 1.0 would require reviewing 3,000 documents for just the control set!  TAR 3.0 beats them both, requiring review of just 200 cluster centers for a somewhat dirty production.

It is worth considering how the results might change with a larger document population.  If everything else remained the same (prevalence and difficulty of the categorization task), the size of the control set required would not change, and the number of training documents required would probably not change very much, but the number of documents involved in the review phase would increase in proportion to the size of the population, so the cost savings from being able to produce documents without reviewing them would be much larger.

In summary, TAR 1.0 gives the user the option to produce documents without reviewing them, but its efficiency is poor, especially when prevalence is low.  Although the number of training documents required for TAR 1.0 when prevalence is low can be reduced by using active learning (not examined in this article) instead of documents chosen randomly for training, TAR 1.0 is still stuck with the albatross of the control set dragging down efficiency.  In some cases (tasks 5, 6, and 7) the control set by itself requires more review labor than the entire document review using CAL.  TAR 2.0 is vastly more efficient than TAR 1.0 if you plan to review all of the documents that are predicted to be relevant, but it doesn’t provide the option to produce documents without reviewing them.  TAR 3.0 borrows some of best aspects of both TAR 1.0 and 2.0.  When all documents that are candidates for production will be reviewed, TAR 3.0 with CAL is just as efficient as TAR 2.0 and has the added benefits of providing a prevalence estimate and a diverse early view of relevant documents.  When it is permissible to produce some documents without reviewing them, TAR 3.0 provides that capability with much better efficiency than TAR 1.0 due to its efficient training and elimination of the control set.

If you like graphs, the gain curves for all seven tasks are shown below.  Documents used for training are represented by solid lines, and documents not used for training are shown as dashed lines.  Dashed lines represent documents that could be produced without review if that is appropriate for the case.  A green dot is placed at the end of the review of cluster centers–this is the point where the TAR 3.0 SAL and TAR 3.0 CAL curves diverge, but sometimes they are so close together that it is hard to distinguish them without the dot.  Note that review of documents for control sets is not reflected in the gain curves, so the TAR 1.0 results require more document review than is implied by the curves.

Task 1. Prevalence is 6.9%.

Task 1. Prevalence is 6.9%.

Task 2. Prevalence is 4.1%.

Task 2. Prevalence is 4.1%.

Task 3. Prevalence is 2.9%.

Task 3. Prevalence is 2.9%.

Task 4. Prevalence is 1.1%.

Task 4. Prevalence is 1.1%.

Task 5. Prevalence is 0.68%.

Task 5. Prevalence is 0.68%.

Task 6. Prevalence is 0.52%.

Task 6. Prevalence is 0.52%.

Task 7. Prevalence is 0.32%.

Task 7. Prevalence is 0.32%.


Gain Curves

You may already be familiar with the precision-recall curve, which describes the performance of a predictive coding system.  Unfortunately, the precision-recall curve doesn’t (normally) display any information about the cost of training the system, so it isn’t convenient when you want to compare the effectiveness of different training methodologies.  This article looks at the gain curve, which is better suited for that purpose.

The gain curve shows how the recall achieved depends on the number of documents reviewed (slight caveat to that at the end of the article).  Recall is the percentage of all relevant documents that have been found.  High recall is important for defensibility.  Here is an example of a gain curve (click to enlarge):


The first 12,000 documents reviewed in this example are randomly selected documents used to train the system.  Prevalence is very low in this case (0.32%), so finding relevant documents using random selection is hard.  The system needs to be exposed to a large enough number of relevant training documents for it to learn what they look like so it can make good predictions for the relevance of the remaining documents.

After the 12,000 training documents are reviewed the system orders the remaining documents to put the ones that are most likely to be relevant (based on patterns detected during training) at the top of the list.  To distinguish the training phase from the review phase I’ve shown the training phase as a solid line and review phase as a dashed line.  Review of the remaining documents starts at the top of the sorted list.  The gain curve is very steep at the beginning of the review phase because most of the documents being reviewed are relevant, so they have a big impact on recall.  As the review progresses the gain curve becomes less steep because you end up reviewing documents that are less likely to be relevant.  Review proceeds until a desired level of recall, such as 75% (the horizontal dotted line), is achieved.  The goal is to find the system and workflow that achieves the recall target at the lowest cost (i.e., the one that crosses the dotted line farthest to the left, with some caveats below).

What is the impact of using the same system with a larger or smaller number of randomly selected training documents?  This figure shows the gain curves for 9,000 and 15,000 training documents in addition to the 12,000 training document curve seen earlier:


If the goal is to reach 75% recall, 12,000 is the most efficient option among the three considered because it crosses the horizontal dotted line with the least document review.  If the target was a lower level of recall, such as 70%, 9,000 training documents would be a better choice.  A larger number of training documents usually leads to better predictions (the gain curve stays steep longer during the review phase), but there is a point where the improvement in the predictions isn’t worth the cost of reviewing additional training documents.

The discussion above assumed that the cost of reviewing a document during the training phase is the same as the cost of reviewing a document during the review phase.  That will not be the case if expensive subject matter experts are used to review the training documents and low-cost contract reviewers are used for the review phase.  In that situation, the optimal result is less straightforward to identify from the gain curve.

In some situations it may be possible produce documents without reviewing them if there is no concern about disclosing privileged documents (because there are none or because they are expected to be easy to identify by looking at things like the sender/recipient email address) or non-relevant documents (because there is no concern about them containing trade secrets or evidence of bad acts not covered by the current litigation).  When it is okay to produce documents without reviewing them, the document review associated with the dashed part of the curve can be eliminated in whole or in part.  For example, documents predicted to be relevant with high confidence may be produced without review (unless they are identified as potential privileged), whereas documents with a lower likelihood of being relevant might be reviewed to avoid disclosing too many non-relevant documents.  Again, the gain curve would not show the optimal choice in a direct way–you would need to balance the potential harm (even if small) of producing non-relevant documents against the cost of additional training.

The predictive coding process described in this article, random training documents followed by review (with no additional learning by the algorithm), is sometimes known as Simple Passive Learning (SPL), which is one example of a TAR 1.0 workflow.  To determine the optimal point to switch from training to review with TAR 1.0, a random set of documents known as a control set is reviewed and used to monitor learning progress by comparing the predictions for the control set documents to their actual relevance tags.  Other workflows and analysis of their efficiency via gain curves will be the subject of my next article.

TAR 3.0 and Training of Predictive Coding Systems (Webinar)

A recording of the webinar described below is available here.

Bill Dimm will workflow_cluster_CALbe giving a webinar on training of predictive coding systems and the new TAR 3.0 workflow (not to be confused with Ralph Losey’s Predictive Coding 3.0).  This webinar should be accessible to those that are new to predictive coding while also providing new insights for those who are more experienced.  Expect lots of graphs and animations.  It will be held on December 10, 2015 at 1:00pm EST, and it will be recorded.  Those who register will be able view the video during the three days following the webinar, so please register even if the live event doesn’t fit into your schedule.



Highlights from the East Coast eDiscovery & IG Retreat 2015

This was the second year that Ing3nious has held a retreat on the east coast, with other events organized by Chris LaCour held in California going back five years. east_coast_2015_beach The event was held at the Wequassett Resort in Cape Cod.  As always, the event was well-organized and the location was beautiful.  Luckily, the weather was fantastic.  My notes below only capture a small amount of the information presented. There were often two simultaneous sessions, so I couldn’t attend everything.

Keynote: Away with Words: The Myths and Misnomers of Conventional Search Strategies

Thomas Barnett started the keynote by asking the audience to suggest keyword searches to find items discussing the meaning of existence.  He then said that he had in mind “to be, or not to be” and pointed out that it contains only stop words.  He then described unsupervised (clustering) and supervised (predictive coding) machine learning.  He talked about entity extraction, meaning the identification of dates and names of people and companies in a document.  He talked about sentiment analysis and how a person might change their language when they are doing something wrong.  He also pointed out that a product may have different names in different countries, which can make it easy to miss things with keyword search.

Advancing Discovery: What if Lawyers are the Problem?

I couldn’t attend this one.

Turbulent Sea in the Safe Harbor.  Is There a Lifeboat for Transfers of EU Data to the US?

Max Schrems complained to the Irish Data Protections Commissioner 22 times about the Safe Harbor Privacy Principles failing to protect the privacy of E.U. citizens’ data when companies move the data to the U.S..  After Snowden released information on NSA data collection, Schrems complained a 23rd time.  Ultimately, a judge found the Safe Harbor to be invalid.east_coast_2015_seminar

Companies must certify to the Department of Commerce that they will adhere to the Safe Harbor Privacy Principles.  Many e-discovery service providers were pressured to certify so they could bring data to the U.S. for discovery even though e-discovery usage of the data would involve very bad privacy violations.

Some argue that there is no other legal mechanism that could work for bringing data to the U.S. because the U.S. government can pick up everything, so no guarantees above privacy can be made.   The best option would be to get consent from the person, but it must be done in a very clear manner specifying what data and who will see it.  An employer asking an employee for consent would be seen as coercive.  It will be hard to get consent from someone if you are investigating them for criminal activity.

There is really no way to move data from Europe to the U.S. for litigation without violating the law.  Consent would be required not just from the custodian but from everyone in the emails.  Some countries (France, Germany, and Switzerland) have blocking statutes that make taking the data a criminal offense.

Ethics: eDiscovery, Social Media, and the Internet of Things

I couldn’t attend this one.

Understanding the Data Visualization Trend in Legal

I was on this panel, so I didn’t take notes.  I did mention Vischeck, which allows you to see what your graphics would look like to a color-blind person.

Information Governance – How Do You Eat an Elephant?

I couldn’t attend this one.

Email Laws, IG Policies and the “Smoking Gun”

There has been confusion over what should be considered a record.  In the past, emails that were considered to be records were printed and stored.  Now email should be considered to be a record by default.  30-day retention policies are hard to defend.  Keep deleted emails for 60 days and use analytics to identify emails that employees should not have deleted so they can be saved.  Use automated logging to show compliance.

Protecting Enterprise Data Across Partners, Providers and the Planet

I couldn’t attend this one.

Defeating Analysis Paralysis – Strategies and Success Stories for Implementing IG Policies and Using TAR / Data Analytics

Berkeley Research Group finds that most companies are still keeping everything.  The longer data is kept, the less value it has to the company and the more risk it poses (ediscovery cost and privacy issues if there is a breach).  Different departments within the company may want different retention rules.  Breaches cost the company in lawsuits and in reputation.  The E.U. requires breach notification within 24 hours.east_coast_2015_diningroom

Having employees tag documents gives low-quality tags (they aren’t lawyers), but retention based on those tags is good enough to satisfy the court.  Need employees to follow the retention policy, so keep it simple.  Some speculate that insurance providers may end up driving info governance by forcing their clients to do it.

The Coalition of Technology Resources for Lawyers found that 56% of legal departments are reporting that they use analytics.  Clustering can help with investigation and determining search terms.  Look at email domain names (e.g., to cull.  Note that email journaling keeps everything.  Analytics technology has improved, so if you were disappointed in the past you might want to try it again.

How Automated Digital Discovery is Changing eDiscovery as We Know It

I couldn’t attend this one.

Creating Order Out of Chaos: Framing and Taming Data Discovery Challenges in Expedited Matters

This panel started by walking through a (hypothetical?) investigation of a head of operations who left and joined a competitor in violation of a non-compete agreement that was determined to be unenforceable.  Did he transfer company data to the competitor?

Look for evidence that USB devices were used on the company laptop.  Unfortunately, you can’t tell what was copied onto them.  Look for attempts to hide what was done, such as removal of USB insertion data from the current registry (but failing to remove from the registry snapshot).  Look at the WiFi connection history for connections to the competitor’s network.  It is very important to explain the situation to the forensics person and communicate with him/her frequently about what you each have found in order to develop a picture of what actually happened.

If you hire someone from a competitor and there is suspicion that they took data from their previous employer, ambush them and take all their devices before they have a chance to destroy anything.  This will show the judge that you were not complicit.

When investigating someone who quit on bad terms, look for deals with “special terms” or side letter deals — they may be a sign of fraud.  Be careful about any applicable European laws.  Europe says you can’t move the data to the U.S., but the SEC doesn’t care.  Can you use a review tool in the U.S. with the data in Europe?  Officially, no, but it is less bad than moving the data.  Everyone says you can’t produce the data from Europe, but everyone does.

Make sure your agreements are up to date and are written by the attorney that will litigate them.

Just Patch and Pray?

A study by Verizon found that 90% of breaches are caused by employees.  Info governance can reduce risk.  Keeping everything is risky due to e-discovery, risk of breach, and having to explain loss of old data to customers.east_coast_2015_lighthouse

Email problems include bad passwords, use of the same password on multiple websites so having one hacked can allow access to others, and getting inside the network (emailed malware).  2-factor authentication is recommended.  Don’t send an email to the SEC with BCC to the client or the client might hit reply-all and say something problematic — instead, email only the SEC and forward a copy to the client later.

Mobile technology can create discovery headaches, needs to be managed/updated/wiped remotely, and can easily be lost.  Encrypt, audit, and apply anti-malware.  BYOD should be limited to enterprise-ready devices.  Avoid insecure WiFi.  Control access to enterprise data.  Secure data in transit.  Ensure that devices get updated/upgraded.

Unaware or non-compliant employees need training.  When training to spot phishing emails, services can test the employees by sending phishing emails that report who clicked on them.

Vendors and third parties that handle enterprise data can be a problem.  Regulators require vendor oversight.  Limit access to necessary systems.  Segregate sensitive data.  Beware of payroll vendors and the possibility of identity theft from the data they hold.  Make sure cybersecurity insurance policy covers vendors.

Employees want data access from anywhere.  Encrypting email is hard — better to use collaborative workspaces.  Home networks should be protected.  Don’t use the neighbor’s Internet connection.

After having a breach, 39% of companies still don’t form a response plan.  There is no federal data breach notification law, but many states have such laws.  You may need to notify employees, customers, and the attorney general in some specific time frame.  Also notify your insurance company.

Mergers & Acquisitions: Strategy and Execution Concerns

I couldn’t attend this one.

Disclosing Seed Sets and the Illusion of Transparency

There has been a great deal of debate about whether it is wise or possibly even required to disclose seed sets (training documents, possibly including non-relevant documents) when using predictive coding.  This article explains why disclosing seed sets may provide far less transparency than people think.

seed_sproutThe rationale for disclosing seed sets seems to be that the seed set is the input to the predictive coding system that determines which documents will be produced, so it is reasonable to ask for it to be disclosed so the requesting party can be assured that they will get what they wanted, similar to asking for a keyword search query to be disclosed.

Some argue that the seed set may be work product (if attorneys choose which documents to include rather than using random sampling).  Others argue that disclosing non-relevant training documents may reveal a bad act other than the one being litigated.  If the requesting party is a  competitor, the non-relevant training documents may reveal information that helps them compete.  Even if the producing party is not concerned about any of the issues above, it may be reluctant to disclose the seed set due to fear of establishing a precedent it may not want to be stuck with in future cases having different circumstances.

Other people are far more qualified to debate the legal and strategic issues than I am.  Before going down that road, I think it’s worthwhile to consider whether disclosing seed sets really provides the transparency that people think.  Some reasons why it does not:

  1. If you were told that the producing party would be searching for evidence of data destruction by doing a keyword search for “shred AND documents,” you could examine that query and easily spot deficiencies.  A better search might be “(shred OR destroy OR discard OR delete) AND (documents OR files OR records OR emails OR evidence).”  Are you going to review thousands of training documents and realize that one relevant training document contains the words “shred” and “documents” but none of the training documents contain “destroy” or “discard” or “files”?  I doubt it.
  2. You cannot tell whether the seed set is sufficient if you don’t have access to the full document population.  There could be substantial pockets of important documents that are not represented in the seed set–how would you know?  The producing party has access to the full population, so they can do statistical sampling to measure the quality (based on number of relevant documents, not their importance) of the predictions the training set will produce.  The requesting party cannot do that–they have no way of assessing adequacy of the training set other than wild guessing.
  3. You cannot tell whether the seed set is biased just by looking at it.  Again, if you don’t have access to the full population, how could you know if some topic or some particular set of keywords is under or over represented?  If training documents were selected by searching for “shred AND Friday,” the system would see both words on all (or most) of the relevant documents and would think both words are equally good indicators of relevance.  Would you notice that all the relevant training documents happen to contain the word “Friday”?  I doubt it.
  4. Suppose you see an important document in the seed set that was correctly tagged as being relevant.  Can you rest assured that similar documents will be produced?  Maybe not.  Some classification algorithms can predict a document to be non-relevant when it is a near-dupe or even an exact dupe of a relevant training document.  I described how that could happen in this article.  How can you claim that the seed set provides transparency if you don’t even know if a near-dupe of a relevant training document will be produced?
  5. Poor training doesn’t necessarily mean that relevant documents will be missed.  If a relevant document fails to match a keyword search query, it will be missed, so ensuring that the query is good is important.  Most predictive coding systems generate a relevance score for each document, not just a binary yes/no relevance prediction like a search query.  Whether or not the predictive coding system produces a particular relevant document doesn’t depend solely on the training set–the producing party must choose a cutoff point in the ranked document list that determines which documents will be produced.  A poorly trained system can still achieve high recall if the relevance score cutoff is chosen to be low enough.  If the producing party reviews all documents above the relevance score cutoff before producing them, a poorly trained system will require a lot more document review to achieve satisfactory recall.  Unless there is talk of cost shifting, or the producing party is claiming it should be allowed to stop at modest recall because reaching high recall would be too expensive, is it really the requesting party’s concern if the producing party incurs high review costs by training the system poorly?
  6. One might argue that the producing party could stack the seed set with a large number of marginally relevant documents while avoiding really incriminating documents in order to achieve acceptable recall while missing the most important documents.  Again, would you be able to tell that this was done by merely examining the seed set without having access to the full population?  Is the requesting party going to complain that there is no smoking gun in the training set?  The producing party can simply respond that there are no smoking guns in the full population.
  7. The seed set may have virtually no impact on the final result.  To appreciate this point we need to be more specific about what the seed set is, since people use the term in many different ways (see Grossman & Cormack’s discussion).  If the seed set is taken to be a judgmental sample (documents selected by a human, perhaps using keyword search) that is followed by several rounds of additional training using active learning, the active learning algorithm is going to have a much larger impact on the final result than the seed set if active learning contributes a much larger number of relevant documents to the training.  In fact, the seed set could be a single relevant document and the result would have almost no dependence on which relevant document was used as the seed (see the “How Seed Sets Influence Which Documents are Found” section of this article).  On the other hand, if you take a much broader definition of the seed set and consider it to be all documents used for training, things get a little strange if continuous active learning (CAL) is used.  With CAL the documents that are predicted to be relevant are reviewed and the reviewers’ assessments are fed back into the system as additional training to generate new predictions.  This is iterated many times.  So all documents that are reviewed are used as training documents.  The full set of training documents for CAL would be all of the relevant documents that are produced as well as all non-relevant documents that were reviewed along the way.  Disclosing the full set of training documents for CAL could involve disclosing a very large number of non-relevant documents (comparable to the number of relevant documents produced).

Trying to determine whether a production will be good by examining a seed set that will be input into a complex piece of software to analyze a document population that you cannot access seems like a fool’s errand.  It makes more sense to ask the producing party what recall it achieved and to ask questions to ensure that recall was measured sensibly.  Recall isn’t the whole story–it measures the number of relevant documents found, not their importance.  It makes sense to negotiate the application of a few keyword searches to the documents that were culled (predicted to be non-relevant) to ensure that nothing important was missed that could easily have been found.  The point is that you should judge the production by analyzing the system’s output, not the training data that was input.

Detecting Fraud Using Benford’s Law: Mathematical Details

When people fabricate numbers for fraudulent purposes they often fail to take Benford’s Law into account, making it possible to detect the fraud.  This article is a supplement to my article “Detecting Fraud Using Benford’s Law” (if the link doesn’t take you directly to the right page, it is PDF page number 69 or printed page number 67) from the Summer 2015 issue of Criminal Justice.

Benford’s Law says that naturally occurring numbers that span several orders of magnitude (i.e., differing numbers of digits, or differing powers of 10 when written in scientific notation like 3.15 x 102) should start with “1” 30.1% of the time, and they should start with “9” only 4.6% of the time.  The probability of each leading digit is given in this chart (click to enlarge):


Someone who attempts to commit fraud by fabricating numbers (e.g., fake invoices or accounting entries) without knowing Benford’s Law will probably generate numbers that don’t have the expected probability distribution.  They might, for example, assume that numbers starting with “1” should have the same probability as numbers starting with any other digit, resulting in their fraudulent numbers looking very suspicious to someone who knows Benford’s Law.

The Criminal Justice article details the history of Benford’s Law and explains when Benford’s Law is expected to be applicable.  What I’ll add here is more mathematical detail on how the probability of a particular leading digit, or sequence of digits, can be computed.

The key assumption behind Benford’s Law is scale invariance, meaning that things shouldn’t change if we switch to a different unit of measure.  If we convert a large set of monetary values from dollars to yen, or pesos, or any other currency (real or concocted), the percentage of values starting with a particular digit should stay (approximately) the same.  Suppose we convert from dollars to a currency that is worth half as much.  An item that costs $1 will cost 2 units of the new currency.  An item that costs $1.99 will cost 3.98 units of the new currency.  Likewise, $1000 becomes 2000 units of the new currency, and $1999 becomes 3998 units of the new currency.  So the probability of a number starting with “1” has to equal the sum of the probabilities of a numbers starting with “2” or “3” if the probability of a particular digit will remain unchanged by switching currencies.  The probabilities from the bar chart above behave as expected (30.1% = 17.6% + 12.5%).

To prove that scale invariance leads to the probabilities predicted by Benford’s Law, start by converting all possible numbers to scientific notation (e.g. 315 is written as 3.15 x 102) and realize that the power of 10 doesn’t matter when our only concern is the probability of a certain leading digit.  So all numbers map to the interval [1,10) as shown in this figure:


Next, assume there is some function, f(x), that gives the probability of each possible set of leading digits (technically a probability density function), so f(4.25) accounts for the probability of finding a value to be 0.0425, 0.425, 4.25, 42.5, 425, 4250, etc..  Our goal is to find f(x).  This graph illustrates the constraint that scale invariance puts on f(x):


The area under the f(x) curve between x=2 and x=2.5, shown in red, must equal the area between x=3 and x=4, shown in orange, because a change in scale that multiplies all values by 2 will map the values from the red region into the orange region.  Such relationships between areas under various parts of the curve must be satisfied for any change of scale, not just a factor of two.

Finally, let’s get into the gory math and prove Benford’s Law (warning: calculus!).  The probability, P(D), of a number starting with digit D is the area under the f(x) curve from D to D+1:

P(D) = \int_D^{D+1} f(x) \,dx

Assuming that scale invariance holds, the probability has to stay the same if we change scale such that all values are multiplied by β:

P(D) = \int_{\beta D}^{\beta (D+1)} f(x) \,dx

The equation above must be true for any β, so the derivative with respect to β must be zero:

\frac{\partial}{\partial \beta} P(D) = 0 \ \ \ \Rightarrow\ \ \ (D + 1) f\left(\beta(D + 1)\right) - D f(\beta D) = 0

The equation above is satisfied if f(x)=c/x, where c is a constant.  The total area under the f(x) curve must be 1 because it is the probability that a number will start with any possible set of digits, so that determines the value of c to be 1/ln(10), i.e. 1 over the natural logarithm of 10:

\int_1^{10} f(x) \,dx = 1 \ \ \ \Rightarrow\ \ \ f(x) = \frac{1}{x \ln(10)}

Finally, plug f(x) into our first equation and integrate to get a result in terms of base-10 logarithms:

P(D) = \frac{\ln(D+1) - \ln(D)}{\ln(10)} = \log_{10}(D + 1) - \log_{10}(D)

Knowing f(x), we can compute the probability of finding a number with any sequence of initial digits.  To find the probability of starting with 2 we integrated from 2 to 3.  To find the probability of starting with the two digits 24, we integrate f(x) from 2.4 to 2.5.  To find the probability of starting with the three digits 247, we integrate f(x) from 2.47 to 2.48.  The general equation for two leading digits, D1D2, is:

P(D_1D_2) = \log_{10}(D_1.D_2 + 0.1) - \log_{10}(D_1.D_2)

Which is equivalent to:

P(D_1D_2) = \log_{10}(D_1D_2 + 1) - \log_{10}(D_1D_2)

For example, the probability of a number starting with “2” followed by “4” is log10(25)-log10(24) = 1.77%.

Similarly, the equation for three leading digits, D1D2D3, is:

P(D_1D_2D_3) = \log_{10}(D_1D_2D_3 + 1) - \log_{10}(D_1D_2D_3)