Author Archives: Bill Dimm

About Bill Dimm

Dr. Dimm is the founder and CEO of Hot Neuron LLC. He developed the algorithms for predictive coding, conceptual clustering, and near-dupe detection used in the company's Clustify software. He is writing a book that is tentatively titled "Predictive Coding: Theory & Practice." He has over two decades of experience in the development and application of sophisticated mathematical models to solve real-world problems in the fields of theoretical physics, mathematical finance, information retrieval, and e-discovery. Prior to starting Hot Neuron, he did derivative securities research at Banque Nationale de Paris. He has a Ph.D. in theoretical elementary particle physics from Cornell University.

Highlights from the SoCal eDiscovery & IG Retreat 2017

The 2017 SoCal eDiscovery & IG Retreat was held at the Pelican Hill Resort in Newport Coast, California.   The format was somewhat different from other recent Ing3nious retreats, having at single session at a time instead of two sessions in parallel.  My notes below provide some highlights.  I’ve posted my full set of photos from the conference and nearby Crystal Cove here.socal2017_building

How Well Can Your Organization Protect Against Encrypted Traffic Threats?
Companies should be concerned about encrypted traffic, because they don’t know what is leaving their network.  Get encryption keys for cloud services the company uses so you can check outgoing content and block all other encrypted traffic — if something legitimate breaks, employees will let you know.  It is important to distinguish personal Drop Box use from corporate use.  Make sure you have a policy that says all corporate devices are subject to inspection and monitoring.  The CSO should report to the CEO rather than the CIO or too much ends up being spent on technology with too little spent on risk reduction.  Security tech must be kept up to date.  Some security vendors are using artificial intelligence.  The board of directors needs to be educated about their fiduciary duty to provide oversight for security, as established in a 1996 case in Delaware (see this article).  In what country is the backup of your cloud data stored?  That could be important during litigation.  The amount of unstructured data companies have can be surprising, and represents additional risk.  When the CSO reports to the board, he/she should speak in terms of risk (don’t use tech speak).  Build in security from the beginning when starting new projects.  GDPR violations could bring penalties of up to 4% of revenue. Guidance papers on GDPR are all over 40 pages long.  “Internet of Things” devices (e.g., refrigerators) are typically not secure.  Use DNS to detect attempts by IoT devices to call out.  IoT is collecting data about you to sell.  The book Future Crimes by Marc Goodman was recommended.

Using Technology To Reduce eDiscovery Spend
Artificial intelligence (AI) can be used before collection to reduce data volume.  Have a conversation about what’s really needed and use ECA to cull by date, topic, etc.  Process data from key players first.  It is important for project managers to know the data.  Parse out domain names, see who is talking to whom, see which folders people really have access to, and get rid of bad file types.  Image the machine of the person who will be leaving, then tell them you will be imaging the machine in the near future and see what they delete.  Use sentiment analysis and see if sentiment changes over time.  Use clustering to identify stuff that can be culled (e.g., stuff about the NFL).  Use clustering, rather than random sampling, to see what the data looks like.  Redaction of things like social security numbers can be automated.socal2017_hall

It’s All Greek To Me: Multi-Language Document Review from Shakespeare To FCPA
Examples were given of Craigslist ads seeking temporary people for foreign language document review, showing that companies performing such reviews may not have capable people on staff.  Law firms are relying on external providers to manage reviews in languages in which they are not fluent. English in Singapore is not the same as English in the U.S. (different expressions) — cultural context is important.  There are 6,900 languages around the world.  Law firms must do diligence to ensure a language expert is trustworthy.  Law firms don’t like being beta testers for technologies like TAR and machine translation.  Communications in Asia are often not in text file format (e.g., chat applications) and can involve hundreds of thousands of non-standard emojis (how to even render them?).  Facebook got a Palestinian man arrested by mistranslating his “good morning” to “attack them” (see this article).  One speaker suggested Googling “fraudulent foreign language reviewers” (the top match is here).  There was skepticism about the ALTA language proficiency test.

Artificial Intelligence – Facial Expression Analytics As A Competitive Advantage In Risk Mitigation
Monitoring emotional response can provide an advantage at trial.  Universal emotions: joy, sadness, surprise, fear, anger, disgust, and contempt.  The lawyer should avoid causing sadness since that is detrimental to being liked — let the witness do it.  Emotional response can depend on demographics.  For example, the contempt response depends on age, and women tend to show a larger fear response.  Software can now detect emotion from facial photos very quickly.  One panelist warned against using the iPhone X’s authentication via face recognition because Apple has software for detecting emotion and could monitor your mood.  80% of what a jury picks up on is non-verbal.  Analyze video of depositions to look for ways to improve.  Senior people at companies refuse to believe they don’t come across well, but they often show signs of contempt at questions they deem to be petty.  There is no facial expression for deception — look for a shift in expression.  Realize that software may not be making decisions in the same way as a human would.  For example, a neural network that did a good job of distinguishing wolves from dogs was actually making the decision based on the presence or absence of snow in the background.

TAR: What Have We Learned?
I moderated this panel, so I didn’t take notes.

Bridging The Gap Between Inside And Outside Counsel: Next Generation Strategies For Collaborating On Complex Litigation Matters
Communicate about what you actually need or they may collect everything regardless of date or custodian, resulting in high costs for hosting.  Insourcing is a trend — the company keeps the data in house (reduce cost and risk) and provides outside counsel with access.  socal2017_golfThis means imposing technology on the outside counsel.  One benefit of insourcing is that in house counsel learns about the data, which may help with future cases.  Another trend is disaggregation, where legal tasks are split up among different law firms instead of using a single firm for everything.  It is important to ensure that technologies being used are understood by all parties from the start to avoid big problems later.  Paralegals can be good at keeping communication flowing between the outside attorney and the client.  Tech companies that want people to adopt their products need to help outside counsel explain the benefits to clients.

Cyber And Data Security For The GC: How To Stay Out Of Headlines And Crosshairs
I couldn’t attend this panel because I had to catch my flight.

 

Highlights from the NorCal IG Retreat 2017

The 2017 NorCal Information Governance Retreat was norcal2017_lodgeheld by Ing3nious at the Quail Lodge & Golf Club in Carmel Valley, California.  After round table discussions, the retreat featured two simultaneous sessions throughout the day. My notes below provide some highlights from the sessions I was able to attend.  I’ve posted additional photos here.

The intro to the round table discussions included some comments on the evolution of the Internet, the importance of searching for obscenities to find critical documents or to identify data that has been scrubbed (it is implausible that there are no emails containing obscenities for a failing project), the difficulty of searching for “IT” (meaning information technology rather than the pronoun), and the inability of many tools to search for emojis.norcal2017_keynote

TAR: What Have We Learned?
I moderated this panel, so I didn’t take notes.

How Well Can Your Organization Protect Against Encrypted Traffic Threats?
I couldn’t attend this

IG Analytics And Infonomics: The Future Is Now
I couldn’t attend this

Breaches Happen. Going On The Cyber Offense With Deception
Breach stories that were mentioned included Equifax, Target, an employee that built their own (insecure) tunnel to get data out to their home, and an employee that carried data out on a microSD card.  In the RSA / Lockheed Martin breach, a Lockheed contractor was fooled by a phishing email, illustrating how hard it is to keep attackers out.  Email is a very common source of breaches.  A big mistake is not knowing that you’ve been breached.  People put honeypots outside the firewall to detect attacks. It’s better to use deception technology, which puts decoys inside the firewall.

Social Media And Website Information Governance
There has been some regulation of social media, especially for certain industries.  The SEC in 2012 required financial institutions to archive it.  The FTC has been enforcing paid endorsement disclosure guidelines (e.g., Kim Kardashian’s endorsement of a morning sickness drug).  Collecting evidence from social media is tricky.  A screenshot could be photoshopped, so how to prove it is legitimate?  Should collect a screenshot, source code, meta data, and a digital signature with time stamp.  Corporate policy on social media use will depend on the kind of company and the industry it is in.  There should also be a policy on monitoring employee’s social media use.  Companies using an internal social media system are asking for problems.  How will they police/discipline improper usage?  If an employee posts “Why haven’t I seen John lately?” and another replies that John has cancer, you have a problem.  Does a company social media system really improve productivity?  Can you find out who posted something anonymously on public social media?  If they posted from Starbucks or a library, probably not (finding the IP address won’t reveal the person’s identity).  This strategy worked for a bad review of a doctor that was thought to be from another doctor: 1) file in Federal court and get a court order to get the user’s IP address from the social media website, 2) go back to the judge and get a court order to get the ISP to give the identity of the person using that IP address at that time, 3) there is a motion to quash, which confirms that the right person was found (otherwise wouldn’t bother to fight it).

Bridging The Gap Between Inside And Outside Counsel: Next Generation Strategies For Collaborating On Complex Litigation Matters
I couldn’t attend thisnorcal2017_lunch

Preventing Inadvertent Disclosure In A Multi-Language World
Start by identifying teams and process.  Be aware of cultural differences.  Be aware of technological issues — there are 2 or 3 alternatives to MS Word that you might encounter for documents in Korean.  Be aware of laws against removing certain documents from the country.  There was disagreement among panel members about whether review quality of foreign documents was better in the U.S. due to reviewers better understanding U.S. law.  Viewing a document in the U.S. that is stored on a server in the E.U. is not a valid work-around for restrictions on exporting the documents.  Review in the U.S. is much cheaper than reviewing overseas (about 1/5 to 1/10 of the cost).  Violation of GDPR puts 4% of revenue at risk, but a U.S. judge may not care.  Take only what you need out of the country.  Many tools work best when they are analyzing documents in a single language, so use language identification and separate documents by language before analysis.  TAR may not work as well for non-English documents, but it does work.

What’s Your Trust Point?
I couldn’t attend this

Legal Tech And AI – Inventing The Future
Humans are better than computers at handling low-probability outlier events, because there is a lack of training data to teach machines to handle such things.  It is important for the technology to be easy for the user to interact with.  Legal clients are very cost averse, so a free trial of new tech is attractive.

The Cloud, New Technologies And Other Developments In Trade Secret Theft
I couldn’t attend this

Are You Prepared For The Impact Of Changing EU Data Privacy On U.S. Litigation?
I couldn’t attend this

IG Policy Pain Points In E-Discovery
Deletion of data that is not on hold 60 days after an employee norcal2017_mountainsleaves the company may not get everything since other custodians may have copies.  You may find that employees have archived their emails on a local hard drive.  Be clear about data ownership — wiping the phone of an employee that left the company may hit their personal data.  The general counsel is often not involved in decisions like BYOD (treated as an IT decision), but they should be.  Realize that having more data about employee behavior (e.g., GPS tracking) makes the company more responsible.  You rarely need the employee’s phone since there is little data cached there (data is on mail servers, etc.).  You should do info governance compliance testing to ensure that employees are following the procedures.  Policies must be realistic — there won’t be perfect separation of work and personal activity.  Flouted rules may be worse than no rules.  Keep personal data separate (personal folder, personal email address, use phone for accessing Facebook).  When doing an annual cleanup, what about the data from the employee who left the company?  A study showed that 85% of stored data is rot.  Have a checklist that you follow when an employee leaves — don’t wipe the computer without copying stuff you may need.

Highlights from DESI VII / ICAIL 2017

DESI (Discovery of Electronically Stored Information) is a one-day workshop within ICAIL (International Conference on Artificial Intelligence and Law), which is held every other year.  The conference was held in London last month.  Rumor has it that the next ICAIL will be in North America, perhaps Montreal.

I’m not going to go into the DESI talks based on papers and slides that are posteddesi_vii_lecture on the DESI VII website since you can read that content directly.  The workshop opened with a keynote by Maura Grossman and Gordon Cormack where they talked about the history of TREC tracks that are relevant to e-discovery (Spam, Legal, and Total Recall), the limitation on the recall that can be achieved due to ambiguous relevance (reviewer disagreement) for some documents, and the need for high recall when it comes to identifying privileged documents or documents where privacy must be protected.  When looking for privileged documents it is important to note that many tools don’t make use of metadata.  Documents that are missed may be technically relevant but not really important — you should look at a sample to see whether they are important.

Between presentations based on submitted papers there was a lunch where people separated into four groups to discuss specific topics.  The first group focused on e-discovery users.  Visualizations were deemed “nice to look at” but not always useful — does the visualization help you to answer a question faster?  Another group talked about how to improve e-discovery, including attorney aversion to algorithms and whether a substantial number of documents could be missed by CAL after the gain curve had plateaued.  Another group discussed dreams about future technologies, like better case assessment and redacting video.  The fourth group talked about GDPR and speculated that the UK would obey GDPR.desi_vii_buckingham_palace

DESI ended with a panel discussion about future directions for e-discovery.  It was suggested that a government or consumer group should evaluate TAR systems.  Apparently, NIST doesn’t want to do it because it is too political.  One person pointed out that consumers aren’t really demanding it.  It’s not just a matter of optimizing recall and precision — process (quality control and workflow) matters, which makes comparisons hard.  It was claimed that defense attorneys were motivated to lobby against the federal rules encouraging the use of TAR because they don’t want incriminating things to be found.  People working in archiving are more enthusiastic about TAR.

Following DESI (and other workshops conducted in parallel on the first day), ICAIL had three more days of paper presentations followed by another day of workshops.  You can find the schedule is here.  I only attended the first day of non-DESI presentations.  There are two papers from that day that I want to point out.  The first is  Effectiveness Results for Popular e-Discovery Algorithms by Yang, David Grossman, Frieder, and Yurchak.  They compared performance of the CAL (relevance feedback) approach to TAR for several different classification algorithms, feature types, feature weightings,  desi_vii_guardand with/without LSI.  They used several different performance metrics, though they missed the one I think is most relevant for e-discovery (review effort required to achieve an acceptable level of recall).  Still, it is interesting to see such an exhaustive comparison of algorithms used in TAR / predictive coding.  They’ve made their code available here.  The second paper is Scenario Analytics: Analyzing Jury Verdicts to Evaluate Legal Case Outcomes by Conrad and Al-Kofahi.  The authors analyze a large database of jury verdicts in an effort to determine the feasibility of building a system to give strategic litigation advice (e.g., potential award size, trial duration, and suggested claims) based on a data-driven analysis of the case.

Highlights from the Northeast IG Retreat 2017

The 2017 Northeast Information Governance Retreat was held at the Salamander northeast2017_buildingResort & Spa in Middleburg, Virginia.  After round table discussions, the retreat featured two simultaneous sessions throughout the day. My notes below provide some highlights from the sessions I was able to attend.

Enhancing eDiscovery With Next Generation Litigation Management Software
I couldn’t attend this

Legal Tech and AI – Inventing The Futurenortheast2017_keynote
Machines are currently only good a routine tasks.  Interactions with machines should allow humans and machines to do what they do best.  Some areas where AI can aid lawyers: determining how long litigation will take, suggesting cases you should reference, telling how often the opposition has won in the past, determining appropriate prices for fixed fee arrangements, recruiting, or determining which industry on which to focus.  AI promises to help with managing data (e.g., targeted deletion), not just e-discovery.  Facial recognition may replace plane tickets someday.

Zen & The Art Of Multi-Language Discovery: Risks, Review & Translation
I couldn’t attend this

NexLP Demo
The NexLP tool emphasizes feature extraction and use of domain knowledge from external sources to figure out the story behind the data.  It can generate alerts based on changes in employee behavior over time.  Company should have a policy allowing the scanning of emails to detect bad behavior.  It was claimed that using AI on emails is better for privacy than having a human review random emails since it keeps human eyes away from emails that are not relevant.northeast2017_lunch

TAR: What Have We Learned?
I moderated this panel, so I didn’t take notes.

Are Managed Services Manageable?
I couldn’t attend this

Cyber And Data Security For The GC: How To Stay Out Of Headlines And Crosshairs
I couldn’t attend this

The Office Is Out: Preservation And Collection In The Merry Old LandOf Office 365
Enterprise 5 (E5) has advanced analytics from Equivio.  E3 and E1 can do legal hold but don’t have advanced analytics.  There are options available that are not on the website, and there are different builds — people are not all using the same thing.  Search functionality works on limited file types (e.g., Microsoft products).  Email attachments are OK if they are from Microsoft products.  It will not OCR PDFs that lack embedded text.  What about emails attached to emails?  Previously, it only went one layer deep on attachments.  Latest versions say they are “relaxing” that, but it is unclear what that means (how deep?).  User controls sync — are we really searching everything?  Make sure you involve IT, privacy, info governance, etc. if considering transition to 365.  Be aware of data that is already on hold if you migrate to 365.  Start by migrating a small group of people that are not often subject to litigation.  Test each data type after conversion.

How To Make Sense Of Information Governance Rules For Contractors When The Government Itself Can’t?northeast2017_garden
I couldn’t attend this

Judges, The Law And Guidance: Does ‘Reasonableness’ Provide Clarity?
This was primarily about the impact of the new Federal rules of civil procedure.  Clients are finally giving up on putting everything on hold.  Tie document retention to business needs — shouldn’t have to worry about sanctions.  Document everything (e.g., why you chose specific custodians to hold).  Accidentally missing one custodian out of a hundred is now OK.  Some judges acknowledge the new rules but then ignore them.  Boilerplate objections to discovery requests needs to stop — keep notes on why you made each objection.

Beyond The Firewall: Cybersecurity & The Human Factor
I couldn’t attend this

The Theory of Relativity: Is There A Black Hole In Electronic Discovery?northeast2017_social
The good about Relativity: everyone knows it, it has plug-ins, and moving from document to document is fast compared to previous tools.  The bad: TAR 1.0 (federal judiciary prefers CAL).  An audience member expressed concern that as Relativity gets close to having a monopoly we should expect high prices and a lack of innovation.  Relativity One puts kCura in competition with service providers.

The day ended with a wine social.

Substantial Reduction in Review Effort Required to Demonstrate Adequate Recall

Measuring the recall achieved to within +/- 5% to demonstrate that a production is defensible can require reviewing a substantial number of random documents.  For a case of modest size, the amount of review required to measure recall can be larger than the amount of review required to actually find the responsive documents with predictive coding.  This article describes a new method requiring much less document review to demonstrate that adequate recall has been achieved.  This is a brief overview of a more detailed paper I’ll be presenting at the DESI VII Workshop on June 12th (slides available here).

The proportion of a population having some property can be estimated to within +/- 5% by measuring the proportion on a random sample of 400 documents (you’ll also see the number 385 being used, but using 400 will make it easier to follow the examples).  To measure recall we need to know what proportion of responsive documents are produced, so we need a sample of 400 random responsive documents.  Since we don’t know which documents in the population are responsive, we have to select documents randomly and review them until 400 responsive ones are found.  If prevalence is 10% (10% of the population is responsive), that means reviewing roughly 4,000 documents to find 400 that are relevant so that recall can be estimated.  If prevalence is 1%, it means reviewing roughly 40,000 random documents to measure recall.  This can be quite a burden.

multistage_acceptance_from_multistageOnce recall is measured, a decision must be made about whether it is high enough.  Suppose you decide that if at least 300 of the 400 random responsive documents were produced (75%) the production is acceptable.  For any actual level of recall, the probability of accepting the production can be computed (see figure to right).  The probability of accepting a production where the actual recall is less than 70% will be very low, and the probability of rejecting a production where the actual recall is greater than 80% will also be low — this comes from the fact that a sample of 400 responsive documents is sufficient to measure recall to within +/- 5%.

multistage_acceptance_procedureThe idea behind the new method is to achieve the same probability profile for accepting/rejecting a production using a multi-stage acceptance test.  The multi-stage test gives the possibility of stopping the process and declaring the production accepted/rejected long before reviewing 400 random responsive documents.  The procedure is shown in the flowchart to the right (click to enlarge).  A decision may be reached after reviewing enough documents to find just 25 random documents that are responsive.  If a decision isn’t made after reviewing 25 responsive documents, review continues until 50 responsive documents are found and another test is applied.  At worst, documents will be reviewed until 400 responsive documents are found (the same as the traditional direct recall estimation method).

multistage_barriers_85_recall_pathsThe figure to the right shows six examples of the multi-stage acceptance test being applied when the actual recall is 85%.  Since 85% is well above the 80% upper bound of the 75% +/- 5% range, we expect this production to virtually always be accepted.  The figure shows that acceptance can occur long before reviewing a full 400 random responsive documents.  The number of random responsive documents reviewed is shown on the vertical axis.  Toward the bottom of the graph the sample is very small and the percentage of the sample that has been produced may deviate greatly from the right answer of 85%.  As you go up the sample gets larger and the proportion of the sample that is produced is expected to get closer to 85%.  When a green decision boundary is touched, causing the production to be accepted as having sufficiently high recall, the color of the remainder of the path is changed to yellow — the yellow part represents the document review that is avoided by using the multi-stage acceptance method (since the traditional direct recall measurement would involve going all the way to 400 responsive documents).  As you can see, when the actual recall is 85% the number of random responsive documents that must be reviewed is often 50 or 100, not 400.

multistage_effort_for_multistageThe figure to the right shows the average number of documents that must be reviewed using the multi-stage acceptance procedure from the earlier flowchart.  The amount of review required can be much less than 400 random responsive documents.  In fact, the further above/below the 75% target (called the “splitting recall” in the paper) the actual recall is, the less document review is required (on average) to come to a conclusion about whether the production’s recall is high enough.  This creates an incentive for the producing party to aim for recall that is well above the minimum acceptable level since it will be rewarded with a reduced amount of document review to confirm the result is adequate.

It is important to note that the multi-stage procedure provides an accept/reject result, not a recall estimate.  If you follow the procedure until an accept/reject boundary is hit and then use the proportion of the sample that was produced as a recall estimate, that estimate will be biased (the use of “unbiased” in the paper title refers to the sampling being done on the full population, not on a subset [such as the discard set] that would cause a bias due to inconsistency in review of different subsets).

You may want to use a splitting recall other than 75% for the accept/reject decision — the full paper provides tables of values necessary for doing that.

Highlights from the South Central IG Retreat 2017

The 2017 South Central Information Governance Retreat was the first retreat in the Ing3nious series held in Texas at the La Cantera Resort & Spa.  The retreat featured two simultaneous sessions throughout the day.  My notes below provide some highlights from the sessions I was able to attend.

The day started with roundtable discussions that were kicked off by a speaker who talked about the early days of the Internet.  He made the point that new lawyers may know less about how computers actually work even though they were born in an era when they are more pervasive.  He mentioned that one of the first keyword searches he performs when he receives a production is for “f*ck.”  If a company was having problems with a product and there isn’t a single email using that word, something was surely withheld from the production.  He made the point that expert systems that are intended to replace lawyers must be based on how the experts (lawyers) actually think.  How do you identify the 50 documents that will actually be used in trial?

Borrowing Agile Development Concepts To Jump-Start Your Information Governance Program
I couldn’t attend this

Your Duty To Preserve: Avoiding Traps In Troubled Times
When storing data in the cloud, what is actually retained?  How can you get the data out?  Google Vault only indexes newly added emails, not old ones.  The company may not have the right to access employee data in the cloud.  One panelist commented that collection is preferred to preservation in place.

Enhancing eDiscovery With Next Generation Litigation Management Software
I couldn’t attend this one.

Leveraging The Cloud & Technology To Accelerate Your eDiscovery Process
Cloud computing seems to have reached an inflection point.  A company cannot put the resources into security and data protection that Amazon can.  The ability to scale up/down is good for litigation that comes and goes.  Employees can jump into cloud services without the preparation that was required for doing things on site.  Getting data out can be hard.  Office 365 download speed can be a problem (2-3 GB/hr) — reduce data as much as possible.

Strategies For Effectively Managing Your eDiscovery Spend
I couldn’t attend this one.

TAR: What Have We Learned?
I moderated this panel, so I didn’t take notes.

Achieving GDPR Compliance For Unstructured Content
I couldn’t attend this one.

Zen & The Art Of Multi-Language Discovery: Risks, Review & Translation
The translation company should be brought in when the team is formed (it often isn’t done until later).  Help may be needed from translator / localization expert to come up with search terms.  For example, there are 20 ways to say “CEO” in Korean.  Translation must be done by an expert to be certified.  When using TAR, do review in the native language and translate the result before presenting to the legal team.  Translation is much slower than review.  Machine translation has improved over the last 2 years, but it’s not good enough to rely on for anything important.  A translator leaked Toyota’s data to the press — keep the risk in mind and make sure you are informed about the environment where the work is being done (screenshots should be prohibited).

Beyond The Firewall: Cybersecurity & The Human Factor
I couldn’t attend this one.

Ethical Obligations Relating To Metadata
Nineteen states have enacted ethical rules on meta-data.  Sometimes, metadata is enough to tell the whole story.  John McAfee was found and arrested because of GPS coordinates embedded in a photo of him.  Metadata showed that a terminated whistleblower’s employee review was written 3 months after termination.  Forensic collection is important to not spoil the metadata.  Ethical obligations of attorneys are broader than attorney-client privilege.  Should attorneys be encrypting email?  Make the client aware of metadata and how it can be viewed.  The attorney must understand metadata and scrub it as necessary (e.g, change tracking in Word).  In e-discovery metadata is treated like other ESI.  Think about metadata when creating a protective order.  What are the ethical restrictions of viewing and mining metadata received through discovery?  Whether you need to disclose receipt of confidential or privileged metadata depends on the jurisdiction.

Legal Risks Associated With Failing To Have A Cyber Incident Response Plan
I couldn’t attend this one.

“Defensible Deletion” Is The Wrong Frame
Defensible deletion started with an IBM survey that found that on average 69% of corporate data has no value, 6% is subject to litigation hold, and 25% is useful.  IBM started offering to remove 45% of data without doing any harm to a company (otherwise, you don’t have to pay).  Purging requires effort, so make deletion the default.  Statistical sampling can be used to confirm that retention rules won’t cause harm.  After a company said that requested data wasn’t available because it had been deleted in accordance with the retention policy, an employee who was being deposed said he had copied everything to 35 CDs — it can be hard to ensure that everything is gone even if you have the right policy.

 

Highlights from Ipro Innovations 2017

The 16th annual Ipro Innovations conference was held at the Talking Stick Resort.ipro2017_stage  It was a well-organized conference with over 500 attendees, lots of good food and swag, and over two days worth of content.  Sometimes, everyone attended the same presentation in a large hall.  Other times, there were seven simultaneous breakout sessions.  My notes below cover only the small subset of the presentations that I was able to attend.  I visited the Ipro office on the final day.  It’s an impressive, modern office with lots of character.  If you are wondering whether the Ipro people have a sense of humor, you need look no farther than the signs for the restrooms.ipro2017_bathroom

The conference started with a summary of recent changes to the Ipro software line-up, how it enables a much smaller team to manage large projects, and stats on the growing customer base.  They announced that Clustify will soon replace Content Analyst as their analytics engine.  In the first phase, both engines will be available and will be implemented similarly, so the user can choose which one to use.  Later phases will make more of Clustify’s unique functionality available.  They announced an investment by ParkerGale Capital.  Operations will largely remain unchanged, but there may be some acquisitions.  The first evening ended with a party at Top Golf.ipro2017_topgolf

Ari Kaplan gave a presentation entitled “The Opportunity Maker,” where he told numerous entertaining stories about business problems and how to find opportunities.  He explained that doing things that nobody else does can create opportunities.  He contacts strangers from his law school on LinkedIn and asks them to meet for coffee when he travels to their town — many accept because “nobody does that.”  He sends postscards to his clients when traveling, and they actually keep them.  To illustrate the value of putting yourself into the path of opportunity, he described how he got to see the Mets in the World Series.  He mentioned HelpAReporter.com as a way to get exposure for yourself as an expert.

One of the tracks during the breakout sessions was run by The Sedona Conference and offered CLE credits.  One of the TSC presentations was “Understanding the Science & Math Behind TAR” by Maura Grossman.  She covered the basics like TAR 1.0 vs. 2.0, human review achieving roughly 70% recall due to mistakes, and how TAR performs compared to keyword search.  She mentioned that control sets can become stale because the reviewer’s concept of relevance may shift during the review.  People tend to get pickier about relevance as the review progresses, so an estimate of the number of relevant docs taken on a control set at the beginning may be too high.  She also warned that making multiple measurements against the control set can give a biased estimate about when a certain level of performance is achieved (sidenote: this is because people watch for a measure like F1 to cross a threshold to determine training completeness, which is not the best way to use a control set).  She mentioned that she and Cormack have a new paper coming out that compares human review to TAR using better-reviewed data (Tim Kaine’s emails) that addresses some criticisms of their earlier JOLT study.ipro2017_computers

There were also breakout sessions where attendees could use the Ipro software with guidance from the staff in a room full of computers.  I attended a session on ECA/EDA.  One interesting feature that was demonstrated was checking the number of documents matching a keyword search that did not match any of the other searches performed — if the number is large, it may not be a very good search query.

ipro2017_salsaAnother TSC session I attended was by Brady, Grossman, and Shonka on responding to government and internal investigations.  Often (maybe 20% of the time) the government is inquiring because you are a source of information, not the target of the investigation, so it may be unwise to raise suspicion by resisting the request.  There is nothing similar to the Federal Rules of Civil Procedure for investigations.  The scope of an investigation can be much broader than civil discovery.  There is nothing like rule 502 (protecting privilege) for investigations.  The federal government is pretty open to the use of TAR (don’t want to receive a document dump), though the DOJ may want transparency.  There may be questions about how some data types (like text messages) were handled.  State agencies can be more difficult.

Talking Stick ResortThe last session I attended was the analytics roundtable, where Ipro employees asked the audience questions about how they were using the software and solicited suggestions for how it could be improved.  The day ended with the Salsa Challenge (as in food, not dancing) and dinner.  I wasn’t able to attend the presentations on the final day, but the schedule looked interesting.