Monthly Archives: November 2015

TAR 3.0 and Training of Predictive Coding Systems (Webinar)

A recording of the webinar described below is available here.

Bill Dimm will workflow_cluster_CALbe giving a webinar on training of predictive coding systems and the new TAR 3.0 workflow (not to be confused with Ralph Losey’s Predictive Coding 3.0).  This webinar should be accessible to those that are new to predictive coding while also providing new insights for those who are more experienced.  Expect lots of graphs and animations.  It will be held on December 10, 2015 at 1:00pm EST, and it will be recorded.  Those who register will be able view the video during the three days following the webinar, so please register even if the live event doesn’t fit into your schedule.



Highlights from the East Coast eDiscovery & IG Retreat 2015

This was the second year that Ing3nious has held a retreat on the east coast, with other events organized by Chris LaCour held in California going back five years. east_coast_2015_beach The event was held at the Wequassett Resort in Cape Cod.  As always, the event was well-organized and the location was beautiful.  Luckily, the weather was fantastic.  My notes below only capture a small amount of the information presented. There were often two simultaneous sessions, so I couldn’t attend everything.

Keynote: Away with Words: The Myths and Misnomers of Conventional Search Strategies

Thomas Barnett started the keynote by asking the audience to suggest keyword searches to find items discussing the meaning of existence.  He then said that he had in mind “to be, or not to be” and pointed out that it contains only stop words.  He then described unsupervised (clustering) and supervised (predictive coding) machine learning.  He talked about entity extraction, meaning the identification of dates and names of people and companies in a document.  He talked about sentiment analysis and how a person might change their language when they are doing something wrong.  He also pointed out that a product may have different names in different countries, which can make it easy to miss things with keyword search.

Advancing Discovery: What if Lawyers are the Problem?

I couldn’t attend this one.

Turbulent Sea in the Safe Harbor.  Is There a Lifeboat for Transfers of EU Data to the US?

Max Schrems complained to the Irish Data Protections Commissioner 22 times about the Safe Harbor Privacy Principles failing to protect the privacy of E.U. citizens’ data when companies move the data to the U.S..  After Snowden released information on NSA data collection, Schrems complained a 23rd time.  Ultimately, a judge found the Safe Harbor to be invalid.east_coast_2015_seminar

Companies must certify to the Department of Commerce that they will adhere to the Safe Harbor Privacy Principles.  Many e-discovery service providers were pressured to certify so they could bring data to the U.S. for discovery even though e-discovery usage of the data would involve very bad privacy violations.

Some argue that there is no other legal mechanism that could work for bringing data to the U.S. because the U.S. government can pick up everything, so no guarantees above privacy can be made.   The best option would be to get consent from the person, but it must be done in a very clear manner specifying what data and who will see it.  An employer asking an employee for consent would be seen as coercive.  It will be hard to get consent from someone if you are investigating them for criminal activity.

There is really no way to move data from Europe to the U.S. for litigation without violating the law.  Consent would be required not just from the custodian but from everyone in the emails.  Some countries (France, Germany, and Switzerland) have blocking statutes that make taking the data a criminal offense.

Ethics: eDiscovery, Social Media, and the Internet of Things

I couldn’t attend this one.

Understanding the Data Visualization Trend in Legal

I was on this panel, so I didn’t take notes.  I did mention Vischeck, which allows you to see what your graphics would look like to a color-blind person.

Information Governance – How Do You Eat an Elephant?

I couldn’t attend this one.

Email Laws, IG Policies and the “Smoking Gun”

There has been confusion over what should be considered a record.  In the past, emails that were considered to be records were printed and stored.  Now email should be considered to be a record by default.  30-day retention policies are hard to defend.  Keep deleted emails for 60 days and use analytics to identify emails that employees should not have deleted so they can be saved.  Use automated logging to show compliance.

Protecting Enterprise Data Across Partners, Providers and the Planet

I couldn’t attend this one.

Defeating Analysis Paralysis – Strategies and Success Stories for Implementing IG Policies and Using TAR / Data Analytics

Berkeley Research Group finds that most companies are still keeping everything.  The longer data is kept, the less value it has to the company and the more risk it poses (ediscovery cost and privacy issues if there is a breach).  Different departments within the company may want different retention rules.  Breaches cost the company in lawsuits and in reputation.  The E.U. requires breach notification within 24 hours.east_coast_2015_diningroom

Having employees tag documents gives low-quality tags (they aren’t lawyers), but retention based on those tags is good enough to satisfy the court.  Need employees to follow the retention policy, so keep it simple.  Some speculate that insurance providers may end up driving info governance by forcing their clients to do it.

The Coalition of Technology Resources for Lawyers found that 56% of legal departments are reporting that they use analytics.  Clustering can help with investigation and determining search terms.  Look at email domain names (e.g., to cull.  Note that email journaling keeps everything.  Analytics technology has improved, so if you were disappointed in the past you might want to try it again.

How Automated Digital Discovery is Changing eDiscovery as We Know It

I couldn’t attend this one.

Creating Order Out of Chaos: Framing and Taming Data Discovery Challenges in Expedited Matters

This panel started by walking through a (hypothetical?) investigation of a head of operations who left and joined a competitor in violation of a non-compete agreement that was determined to be unenforceable.  Did he transfer company data to the competitor?

Look for evidence that USB devices were used on the company laptop.  Unfortunately, you can’t tell what was copied onto them.  Look for attempts to hide what was done, such as removal of USB insertion data from the current registry (but failing to remove from the registry snapshot).  Look at the WiFi connection history for connections to the competitor’s network.  It is very important to explain the situation to the forensics person and communicate with him/her frequently about what you each have found in order to develop a picture of what actually happened.

If you hire someone from a competitor and there is suspicion that they took data from their previous employer, ambush them and take all their devices before they have a chance to destroy anything.  This will show the judge that you were not complicit.

When investigating someone who quit on bad terms, look for deals with “special terms” or side letter deals — they may be a sign of fraud.  Be careful about any applicable European laws.  Europe says you can’t move the data to the U.S., but the SEC doesn’t care.  Can you use a review tool in the U.S. with the data in Europe?  Officially, no, but it is less bad than moving the data.  Everyone says you can’t produce the data from Europe, but everyone does.

Make sure your agreements are up to date and are written by the attorney that will litigate them.

Just Patch and Pray?

A study by Verizon found that 90% of breaches are caused by employees.  Info governance can reduce risk.  Keeping everything is risky due to e-discovery, risk of breach, and having to explain loss of old data to customers.east_coast_2015_lighthouse

Email problems include bad passwords, use of the same password on multiple websites so having one hacked can allow access to others, and getting inside the network (emailed malware).  2-factor authentication is recommended.  Don’t send an email to the SEC with BCC to the client or the client might hit reply-all and say something problematic — instead, email only the SEC and forward a copy to the client later.

Mobile technology can create discovery headaches, needs to be managed/updated/wiped remotely, and can easily be lost.  Encrypt, audit, and apply anti-malware.  BYOD should be limited to enterprise-ready devices.  Avoid insecure WiFi.  Control access to enterprise data.  Secure data in transit.  Ensure that devices get updated/upgraded.

Unaware or non-compliant employees need training.  When training to spot phishing emails, services can test the employees by sending phishing emails that report who clicked on them.

Vendors and third parties that handle enterprise data can be a problem.  Regulators require vendor oversight.  Limit access to necessary systems.  Segregate sensitive data.  Beware of payroll vendors and the possibility of identity theft from the data they hold.  Make sure cybersecurity insurance policy covers vendors.

Employees want data access from anywhere.  Encrypting email is hard — better to use collaborative workspaces.  Home networks should be protected.  Don’t use the neighbor’s Internet connection.

After having a breach, 39% of companies still don’t form a response plan.  There is no federal data breach notification law, but many states have such laws.  You may need to notify employees, customers, and the attorney general in some specific time frame.  Also notify your insurance company.

Mergers & Acquisitions: Strategy and Execution Concerns

I couldn’t attend this one.

Disclosing Seed Sets and the Illusion of Transparency

There has been a great deal of debate about whether it is wise or possibly even required to disclose seed sets (training documents, possibly including non-relevant documents) when using predictive coding.  This article explains why disclosing seed sets may provide far less transparency than people think.

seed_sproutThe rationale for disclosing seed sets seems to be that the seed set is the input to the predictive coding system that determines which documents will be produced, so it is reasonable to ask for it to be disclosed so the requesting party can be assured that they will get what they wanted, similar to asking for a keyword search query to be disclosed.

Some argue that the seed set may be work product (if attorneys choose which documents to include rather than using random sampling).  Others argue that disclosing non-relevant training documents may reveal a bad act other than the one being litigated.  If the requesting party is a  competitor, the non-relevant training documents may reveal information that helps them compete.  Even if the producing party is not concerned about any of the issues above, it may be reluctant to disclose the seed set due to fear of establishing a precedent it may not want to be stuck with in future cases having different circumstances.

Other people are far more qualified to debate the legal and strategic issues than I am.  Before going down that road, I think it’s worthwhile to consider whether disclosing seed sets really provides the transparency that people think.  Some reasons why it does not:

  1. If you were told that the producing party would be searching for evidence of data destruction by doing a keyword search for “shred AND documents,” you could examine that query and easily spot deficiencies.  A better search might be “(shred OR destroy OR discard OR delete) AND (documents OR files OR records OR emails OR evidence).”  Are you going to review thousands of training documents and realize that one relevant training document contains the words “shred” and “documents” but none of the training documents contain “destroy” or “discard” or “files”?  I doubt it.
  2. You cannot tell whether the seed set is sufficient if you don’t have access to the full document population.  There could be substantial pockets of important documents that are not represented in the seed set–how would you know?  The producing party has access to the full population, so they can do statistical sampling to measure the quality (based on number of relevant documents, not their importance) of the predictions the training set will produce.  The requesting party cannot do that–they have no way of assessing adequacy of the training set other than wild guessing.
  3. You cannot tell whether the seed set is biased just by looking at it.  Again, if you don’t have access to the full population, how could you know if some topic or some particular set of keywords is under or over represented?  If training documents were selected by searching for “shred AND Friday,” the system would see both words on all (or most) of the relevant documents and would think both words are equally good indicators of relevance.  Would you notice that all the relevant training documents happen to contain the word “Friday”?  I doubt it.
  4. Suppose you see an important document in the seed set that was correctly tagged as being relevant.  Can you rest assured that similar documents will be produced?  Maybe not.  Some classification algorithms can predict a document to be non-relevant when it is a near-dupe or even an exact dupe of a relevant training document.  I described how that could happen in this article.  How can you claim that the seed set provides transparency if you don’t even know if a near-dupe of a relevant training document will be produced?
  5. Poor training doesn’t necessarily mean that relevant documents will be missed.  If a relevant document fails to match a keyword search query, it will be missed, so ensuring that the query is good is important.  Most predictive coding systems generate a relevance score for each document, not just a binary yes/no relevance prediction like a search query.  Whether or not the predictive coding system produces a particular relevant document doesn’t depend solely on the training set–the producing party must choose a cutoff point in the ranked document list that determines which documents will be produced.  A poorly trained system can still achieve high recall if the relevance score cutoff is chosen to be low enough.  If the producing party reviews all documents above the relevance score cutoff before producing them, a poorly trained system will require a lot more document review to achieve satisfactory recall.  Unless there is talk of cost shifting, or the producing party is claiming it should be allowed to stop at modest recall because reaching high recall would be too expensive, is it really the requesting party’s concern if the producing party incurs high review costs by training the system poorly?
  6. One might argue that the producing party could stack the seed set with a large number of marginally relevant documents while avoiding really incriminating documents in order to achieve acceptable recall while missing the most important documents.  Again, would you be able to tell that this was done by merely examining the seed set without having access to the full population?  Is the requesting party going to complain that there is no smoking gun in the training set?  The producing party can simply respond that there are no smoking guns in the full population.
  7. The seed set may have virtually no impact on the final result.  To appreciate this point we need to be more specific about what the seed set is, since people use the term in many different ways (see Grossman & Cormack’s discussion).  If the seed set is taken to be a judgmental sample (documents selected by a human, perhaps using keyword search) that is followed by several rounds of additional training using active learning, the active learning algorithm is going to have a much larger impact on the final result than the seed set if active learning contributes a much larger number of relevant documents to the training.  In fact, the seed set could be a single relevant document and the result would have almost no dependence on which relevant document was used as the seed (see the “How Seed Sets Influence Which Documents are Found” section of this article).  On the other hand, if you take a much broader definition of the seed set and consider it to be all documents used for training, things get a little strange if continuous active learning (CAL) is used.  With CAL the documents that are predicted to be relevant are reviewed and the reviewers’ assessments are fed back into the system as additional training to generate new predictions.  This is iterated many times.  So all documents that are reviewed are used as training documents.  The full set of training documents for CAL would be all of the relevant documents that are produced as well as all non-relevant documents that were reviewed along the way.  Disclosing the full set of training documents for CAL could involve disclosing a very large number of non-relevant documents (comparable to the number of relevant documents produced).

Trying to determine whether a production will be good by examining a seed set that will be input into a complex piece of software to analyze a document population that you cannot access seems like a fool’s errand.  It makes more sense to ask the producing party what recall it achieved and to ask questions to ensure that recall was measured sensibly.  Recall isn’t the whole story–it measures the number of relevant documents found, not their importance.  It makes sense to negotiate the application of a few keyword searches to the documents that were culled (predicted to be non-relevant) to ensure that nothing important was missed that could easily have been found.  The point is that you should judge the production by analyzing the system’s output, not the training data that was input.