Tag Archives: TAR

Highlights from the South Central IG Retreat 2017

The 2017 South Central Information Governance Retreat was the first retreat in the Ing3nious series held in Texas at the La Cantera Resort & Spa.  The retreat featured two simultaneous sessions throughout the day.  My notes below provide some highlights from the sessions I was able to attend.

The day started with roundtable discussions that were kicked off by a speaker who talked about the early days of the Internet.  He made the point that new lawyers may know less about how computers actually work even though they were born in an era when they are more pervasive.  He mentioned that one of the first keyword searches he performs when he receives a production is for “f*ck.”  If a company was having problems with a product and there isn’t a single email using that word, something was surely withheld from the production.  He made the point that expert systems that are intended to replace lawyers must be based on how the experts (lawyers) actually think.  How do you identify the 50 documents that will actually be used in trial?

Borrowing Agile Development Concepts To Jump-Start Your Information Governance Program
I couldn’t attend this

Your Duty To Preserve: Avoiding Traps In Troubled Times
When storing data in the cloud, what is actually retained?  How can you get the data out?  Google Vault only indexes newly added emails, not old ones.  The company may not have the right to access employee data in the cloud.  One panelist commented that collection is preferred to preservation in place.

Enhancing eDiscovery With Next Generation Litigation Management Software
I couldn’t attend this one.

Leveraging The Cloud & Technology To Accelerate Your eDiscovery Process
Cloud computing seems to have reached an inflection point.  A company cannot put the resources into security and data protection that Amazon can.  The ability to scale up/down is good for litigation that comes and goes.  Employees can jump into cloud services without the preparation that was required for doing things on site.  Getting data out can be hard.  Office 365 download speed can be a problem (2-3 GB/hr) — reduce data as much as possible.

Strategies For Effectively Managing Your eDiscovery Spend
I couldn’t attend this one.

TAR: What Have We Learned?
I moderated this panel, so I didn’t take notes.

Achieving GDPR Compliance For Unstructured Content
I couldn’t attend this one.

Zen & The Art Of Multi-Language Discovery: Risks, Review & Translation
The translation company should be brought in when the team is formed (it often isn’t done until later).  Help may be needed from translator / localization expert to come up with search terms.  For example, there are 20 ways to say “CEO” in Korean.  Translation must be done by an expert to be certified.  When using TAR, do review in the native language and translate the result before presenting to the legal team.  Translation is much slower than review.  Machine translation has improved over the last 2 years, but it’s not good enough to rely on for anything important.  A translator leaked Toyota’s data to the press — keep the risk in mind and make sure you are informed about the environment where the work is being done (screenshots should be prohibited).

Beyond The Firewall: Cybersecurity & The Human Factor
I couldn’t attend this one.

Ethical Obligations Relating To Metadata
Nineteen states have enacted ethical rules on meta-data.  Sometimes, metadata is enough to tell the whole story.  John McAfee was found and arrested because of GPS coordinates embedded in a photo of him.  Metadata showed that a terminated whistleblower’s employee review was written 3 months after termination.  Forensic collection is important to not spoil the metadata.  Ethical obligations of attorneys are broader than attorney-client privilege.  Should attorneys be encrypting email?  Make the client aware of metadata and how it can be viewed.  The attorney must understand metadata and scrub it as necessary (e.g, change tracking in Word).  In e-discovery metadata is treated like other ESI.  Think about metadata when creating a protective order.  What are the ethical restrictions of viewing and mining metadata received through discovery?  Whether you need to disclose receipt of confidential or privileged metadata depends on the jurisdiction.

Legal Risks Associated With Failing To Have A Cyber Incident Response Plan
I couldn’t attend this one.

“Defensible Deletion” Is The Wrong Frame
Defensible deletion started with an IBM survey that found that on average 69% of corporate data has no value, 6% is subject to litigation hold, and 25% is useful.  IBM started offering to remove 45% of data without doing any harm to a company (otherwise, you don’t have to pay).  Purging requires effort, so make deletion the default.  Statistical sampling can be used to confirm that retention rules won’t cause harm.  After a company said that requested data wasn’t available because it had been deleted in accordance with the retention policy, an employee who was being deposed said he had copied everything to 35 CDs — it can be hard to ensure that everything is gone even if you have the right policy.

 

Highlights from Ipro Innovations 2017

The 16th annual Ipro Innovations conference was held at the Talking Stick Resort.ipro2017_stage  It was a well-organized conference with over 500 attendees, lots of good food and swag, and over two days worth of content.  Sometimes, everyone attended the same presentation in a large hall.  Other times, there were seven simultaneous breakout sessions.  My notes below cover only the small subset of the presentations that I was able to attend.  I visited the Ipro office on the final day.  It’s an impressive, modern office with lots of character.  If you are wondering whether the Ipro people have a sense of humor, you need look no farther than the signs for the restrooms.ipro2017_bathroom

The conference started with a summary of recent changes to the Ipro software line-up, how it enables a much smaller team to manage large projects, and stats on the growing customer base.  They announced that Clustify will soon replace Content Analyst as their analytics engine.  In the first phase, both engines will be available and will be implemented similarly, so the user can choose which one to use.  Later phases will make more of Clustify’s unique functionality available.  They announced an investment by ParkerGale Capital.  Operations will largely remain unchanged, but there may be some acquisitions.  The first evening ended with a party at Top Golf.ipro2017_topgolf

Ari Kaplan gave a presentation entitled “The Opportunity Maker,” where he told numerous entertaining stories about business problems and how to find opportunities.  He explained that doing things that nobody else does can create opportunities.  He contacts strangers from his law school on LinkedIn and asks them to meet for coffee when he travels to their town — many accept because “nobody does that.”  He sends postscards to his clients when traveling, and they actually keep them.  To illustrate the value of putting yourself into the path of opportunity, he described how he got to see the Mets in the World Series.  He mentioned HelpAReporter.com as a way to get exposure for yourself as an expert.

One of the tracks during the breakout sessions was run by The Sedona Conference and offered CLE credits.  One of the TSC presentations was “Understanding the Science & Math Behind TAR” by Maura Grossman.  She covered the basics like TAR 1.0 vs. 2.0, human review achieving roughly 70% recall due to mistakes, and how TAR performs compared to keyword search.  She mentioned that control sets can become stale because the reviewer’s concept of relevance may shift during the review.  People tend to get pickier about relevance as the review progresses, so an estimate of the number of relevant docs taken on a control set at the beginning may be too high.  She also warned that making multiple measurements against the control set can give a biased estimate about when a certain level of performance is achieved (sidenote: this is because people watch for a measure like F1 to cross a threshold to determine training completeness, which is not the best way to use a control set).  She mentioned that she and Cormack have a new paper coming out that compares human review to TAR using better-reviewed data (Tim Kaine’s emails) that addresses some criticisms of their earlier JOLT study.ipro2017_computers

There were also breakout sessions where attendees could use the Ipro software with guidance from the staff in a room full of computers.  I attended a session on ECA/EDA.  One interesting feature that was demonstrated was checking the number of documents matching a keyword search that did not match any of the other searches performed — if the number is large, it may not be a very good search query.

ipro2017_salsaAnother TSC session I attended was by Brady, Grossman, and Shonka on responding to government and internal investigations.  Often (maybe 20% of the time) the government is inquiring because you are a source of information, not the target of the investigation, so it may be unwise to raise suspicion by resisting the request.  There is nothing similar to the Federal Rules of Civil Procedure for investigations.  The scope of an investigation can be much broader than civil discovery.  There is nothing like rule 502 (protecting privilege) for investigations.  The federal government is pretty open to the use of TAR (don’t want to receive a document dump), though the DOJ may want transparency.  There may be questions about how some data types (like text messages) were handled.  State agencies can be more difficult.

Talking Stick ResortThe last session I attended was the analytics roundtable, where Ipro employees asked the audience questions about how they were using the software and solicited suggestions for how it could be improved.  The day ended with the Salsa Challenge (as in food, not dancing) and dinner.  I wasn’t able to attend the presentations on the final day, but the schedule looked interesting.

Webinar: 10 Years Forward and Back: Automation in eDiscovery

George Socha, Doug Austin, David Horrigan, Bill Dimm, and Bill Speros will give presentations in this webinar on the history and future of ediscovery moderated by Mary Mack on December 1, 2016.  Bill Dimm will talk about the evolution of predictive coding technologies and our understanding of best practices, including recall estimation, the evil F1 score, research efforts, pre-culling, and the TAR 1.0, 2.0, and 3.0 workflows.  CLICK HERE FOR RECORDING OF WEBINAR, SLIDES, AND LINKS TO RELATED RESOURCES.

Highlights from the Northeast eDiscovery & IG Retreat 2016

The 2016 Northeast eDiscovery & IG Retreat was held at the Ocean Edge Resort & Golf Club.  It was the third annual Ing3nious retreat held in Cape Cod.  The retreat featured two 2016northeast_mansionsimultaneous sessions throughout the day in a beautiful location.  My notes below provide some highlights from the sessions I was able to attend.  You can find additional photos here.

Peer-to-Peer Roundtables
The retreat started with peer-to-peer round tables where each table was tasked with answering the question: Why does e-discovery suck (gripes, pet peeves, issues, etc.) and how can it be improved?  Responses included:

  • How to drive innovation?  New technologies need to be intuitive and simple to get client adoption.
  • Why are e-discovery tools only for e-discovery?  Should be using predictive coding for records management.
  • Need alignment between legal and IT.  Need ongoing collaboration.
  • Handling costs.  Cost models and comparing service providers are complicated.
  • Info governance plans for defensible destruction.
  • Failure to plan and strategize e-discovery.
  • Communication and strategy.  It is important to get the right people together.
  • Why not more cooperation at meet-and-confer?  Attorneys that are not comfortable with technology are reluctant to talk about it.  Asymmetric knowledge about e-discovery causes problems–people that don’t know what they are doing ask for crazy things.

Catching Up on the Implementation of the Amended Federal Rules
I couldn’t attend this one.

Predictive Coding and Other Document Review Technologies–Where Are We Now?
It is important to validate the process as you go along, for any technology.  It is important to understand the client’s documents.  Pandora is more like TAR 2.0 than TAR 1.0, because it starts giving recommendations based on your feedback right away.  The 2012 Rand Study found this e-discovery cost breakdown:73% document review, 8% collection, and 19% processing.  A question from the audience about pre-culling with keyword search before applying predictive coding spurred some debate.  Although it wasn’t mentioned during the panel, I’ll point out William Webber’s analysis of the Biomet case, which shows pre-culling discarded roughly 40% of the relevant documents before predictive coding was applied.  There are many different ways of charging for predictive coding: amount of data, number of users, hose (total data flowing through) or bucket (max amount of data allowed at one time).  Another barrier to use of predictive coding is lack of senior attorney time (e.g., to review documents for training).  Factors that will aid in overcoming barriers: improving technologies, Sherpas to guide lawyers through the process, court rulings, influence from general counsel.  Need to admit that predictive coding doesn’t work for everything, e.g., calendar entries.  New technologies include anonymization tools and technology to reduce the size of collections.  Existing technologies that are useful: entity extraction, email threading, facial recognition, and audio to text.  Predictive coding is used in maybe less than 1% of cases, but email threading is used in 99%.

It’s All Greek To Me: Multi-Language Discovery Best Practices 2016northeast_intro
Native speakers are important.  An understanding of relevant industry terminology is important, too.  The ALTA fluency test is poor–the test is written in English and then translated to other languages, so it’s not great for testing ability to comprehend text that originated in another language.  Hot documents may be translated for presentation.  This is done with a secure platform that prohibits the translator from downloading the documents.  Privacy laws make it best to review in-country if possible.  There are only 5 really good legal translation companies–check with large firms to see who they use.  Throughput can be an issue.  Most can do 20,000 words in 3 days.  What if you need to do 200,000 in 3 days?  Companies do share translators, but there’s no reason for good translators to work for low-tier companies–good translators are in high demand.  QC foreign review to identify bad reviewers (need proficient managers).  May need to use machine translation (MT) if there are millions of documents.  QC the MT result and make sure it is actually useful–in 85% of cases it is not good enough.  For CJK (Chinese, Japanese, Korean), MT is terrible.  The translation industry is $40 billion.  Google invested a lot in MT but it didn’t help much.  One technology that is useful is translation memory, where repeated chunks of text are translated just once.  People performing review in Japanese must understand the subtlety of the American legal system.

Top Trends in Discovery for 2016
I couldn’t attend this one

Measure Twice, Discover Once 2016northeast_beach
Why measure in e-discovery?  So you can explain what happened and why, for defensibility.  Also important for cost management.  The board of directors may want reports.  When asked for more custodians you can show the cost and expected number of relevant documents that will be added by analyzing the number of keyword search hits.  Everything gets an ID number for tracking and analysis (USB drives, batches of documents, etc.).  Types of metrics ordered from most helpful to most harmful: useful, no metric, not useful, and misleading.  A simple metric used often in document review is documents per hour per reviewer.  What about document complexity, content complexity, number and type of issue codes, review complexity, risk tolerance instructions, number of “defect opportunities,” and number coded correctly?  Many 6-sigma ideas from manufacturing are not applicable due to the subjectivity that is present in document review.

Information Governance and Data Privacy: A World of Risk
I couldn’t attend this one

The Importance of a Litigation Hold Policy
I couldn’t attend this one

Alone Together: Where Have All The Model TAR Protocols Gone? 2016northeast_roof
If you are disclosing details, there are two types: inputs (search terms used to train, shared review of training docs) and outputs (target recall or disclosure of recall).  Don’t agree to a specific level of recall before looking at the data–if prevalence is low it may be hard.  Plaintiff might argue for TAR as a way to overcome cost objections from the defendant.  There is concern about lack of sophistication from judges–there is “stunning” variation in expertise among federal judges.  An attorney involved with the Rio Tinto case recommends against agreeing on seed sets because it is painful and focuses on the wrong thing.  Sometimes there isn’t time to put eyes on all documents that will be produced.  Does the TAR protocol need to address dupes, near-dupes, email threading, etc.?

Information Governance: Who Owns the Information, the Risk and the Responsibility?
I couldn’t attend this one

Bringing eDiscovery In-House — Savings and Advantages
I was on this panel so I didn’t take notes

Webinar: How Automation is Revolutionizing eDiscovery

Doug Austin, Bill Dimm, and Bill Speros will give presentations in this webinar moderated by Mary Mack on August 10, 2016.  In addition to broad topics on automation in e-discovery, expect a fair amount on technology-assisted review, including a description of TAR 1.0, 2.0, and 3.0, comparison to human review, and controversial thoughts on judicial acceptance.  CLICK HERE FOR RECORDED WEBINAR

Highlights from the Masters Conference in NYC 2016

The 2016 Masters Conference in NYC was a one-day e-discovery conference held at the New Yorker.  There were two simultaneous sessions throughout the day, so I couldn’t attend everything.  Here are my notes:MastersNYC2016_lunch

Faster, Better, Cheaper: How Automation is Revolutionizing eDiscovery
I was on this panel, so I didn’t take notes.

Five Forces Changing Corporate eDiscovery
68% of corporations are using some type of SaaS/cloud service.  Employees want to use things like Dropbox and Slack, but it is a challenge to deal with them in ediscovery–the legal department is often the roadblock to the cloud.  Consumer products don’t have compliance built-in.  Ask the vendor for corporate references to check on ediscovery issues.  72% of corporations have concerns about the security of distributing ediscovery data to law firms and vendors.  80% rarely or never audit the technical competence of law firms and vendors (the panel members were surprised by this).  Audits need to be refreshed from time to time.  Corporate data disposition is the next frontier due to changes in the Federal Rules and cybersecurity concerns.  Keeping old data will cause problems later if there is a lawsuit or the company is hacked. Need to make sure all copies are deleted.  96% of corporations use metrics and reporting on their legal departments.  Only 28% think they have enough insight into the discovery process of outside counsel (the panel members were surprised by this since they collaborate heavily with outside counsel).  What is tracked:

65% Data Managed
57% eDiscovery Spend
52% eDiscovery Spend per GB
48% Review Staffing
48% Total Review Spend
39% Technologies Used
30% Review Efficiency

28% of the litigation budget is dedicated to ediscovery. 44% of litigation strategies are affected by ediscovery costs.  92% would use analytics more often if cost was not an issue.  The panelists did not like extra per-GB fees for analytics–they prefer an all-inclusive price (sidenote: If you assume the vendor is collecting money from you somehow in order to pay for development of analytics software, including analytics in the all-inclusive price makes the price higher than it would need to be if analytics were excluded, so your non-analytics cases are subsidizing the cases where analytics are used).

Benefits and Challenges in Creating an Information Governance (IG) Program
I couldn’t attend this one.

Connected Digital Discovery: Can We Get There?
There is an increasing push for BYOD, but 48% of BYOD employees disable security.  Digital investigation, unlike ediscovery, involves “silent holds” where documents are collected without employee awareness.  When investigating an executive, must also investigate or do a hold on the executive’s assistant.  The info security department has a different tool stack than ediscovery (e.g., network monitoring tools), so it can be useful to talk to them.

How to Handle Cross-Border Data Transfers in the Aftermath of the Schrems Case
I couldn’t attend this one.

TAR in litigation and government investigation: Possible Uses and Problems
Tracy Greer said the DOJ wants to know the TAR process used.  Surprisingly, it is often found to deviate from the vendor’s recommended best practices.  They also require disclosure of a random sample (less than 5,000 documents) from the documents that were predicted to be non-relevant (referred to as the “null set” in the talk, though I hate that name).  Short of finding a confession of a felony, they wouldn’t use the documents from the sample against the company–they use the sample to identify problems as early as possible (e.g., misunderstandings about what must be turned over) and really want people to feel that disclosing the sample is safe.  Documents from second requests are not subject to FOIA.  They are surprised that more people don’t seem to do email domain filtering.  Doing keyword search well (sampling and constructing good queries) is hard.  TAR is not always useful.  For example, when looking for price fixing of ebooks by Apple and publishers it is more useful to analyze volume of communications.  TAR is also not useful for analyzing database systems like Peoplesoft and payroll systems.  Recommendations:

Keyword search before TAR No
Initial review by SME Yes
Initial review by large team No
De-dupe first Yes
Consolidate threads No

The “overturn rate” is the rate at which another reviewer disagrees with the relevance determination of the initial reviewer. A high overturn rate could signal a problem. The overturn rate is expected to decrease over time. The DOJ expects the overturn rate to be reported, which puts the producing party on notice that they must monitor quality. The DOJ doesn’t have a specific recall expectation–they ask that sampling be done and may accept a a smaller recall if it makes sense.  Judge Hedges speculated that TAR will be challenged someday and it will be expensive.

The Internet of Things (IoT) Creates a Thousand Points of (Evidentiary) Light.  Can You See It?
I couldn’t attend this one.

The Social Media (R)Evolution: How Social Media Content Impacts e-Discovery Risks and Costs
Social media is another avenue of attack by hackers.  They can hijack an account and use it to send harmful links to contacts.  Hackers like to attack law firms doing M&A due to the information they have.  Once hacked, reliability of all data is now in question–it may have been altered.  Don’t allow employees to install software or apps.  Making threats on social media, even in jest, can bring the FBI to your doorstep in hours, and they won’t just talk to you–they’ll talk to your boss and others.

From Case Management to Case Intelligence: Surfacing Legal Business IntelligenceMastersNYC2016_panel
I couldn’t attend this one.

Early Returns from the Federal Rules of Civil Procedure Changes
New rule 26(b)(1) removes “reasonably calculated to lead to the discovery of admissible evidence.”  Information must be relevant to be discoverable.  Should no longer be citing Oppenheimer.  Courts are still quoting the removed language.  Courts have picked up on the “proportional to the needs of the case” change.  Judge Scheindlin said she was concerned there would be a lot of motion practice and a weakening of discovery with the new rules, but so far the courts aren’t changing much.  Changes were made to 37(e) because parties were over-preserving.  Sanctions were taken out, though there are penalties if there was an intent to deprive the other party of information.  Otherwise, the cure for loss of ESI may be no greater than necessary to cure prejudice.  Only applies to electronic information that should have been preserved, only applies if there was a failure to take reasonable steps, and only applies if the information cannot be restored/replaced via additional discovery.  What are “reasonable steps,” though?  Rule 1 requires cooperation, but that puts lawyers in an odd position because clients are interested in winning, not justice.  This is not a sanctions rule, but the court can send you back.  Judge Scheindlin said judges are paying attention to this.  Rule 4(m) reduces the number of days to serve a summons from 120 to 90.  16(b)(2) reduces days to issue a scheduling order after defendant is served from 120 to 90, or from 90 to 60 after defendant appears.  26(c)(1)(B) allows the judge to allocate expenses (cost shiftinMastersNYC2016_receptiong).  34(b)(2)(B) and 34(b)(2)(C) require greater specificity when objecting to production (no boilerplate) and the objection must state if responsive material was withheld due to the objection.  The 50 states are not all going along with the changes–they don’t like some parts.

Better eDiscovery: Leveraging Technology to its Fullest
When there are no holds in place, consider what you can get rid of.  Before discarding the discovery set, analyze it to see how many of the documents violated the retention policy–did those documents hurt your case?  TAR can help resolve the case faster.  Use TAR on incoming documents to see trends.  Could use TAR to help with finding privileged documents (thought the panelist admitted not having tried it).  Use TAR to prioritize documents for review even if you plan to review everything.MastersNYC2016_empire_state  Clustering helps with efficiency because all documents of a particular type can be assigned to the same lawyer.  Find gaps in the production early–the judge will be skeptical if you wait for months.  Can use clustering on custodian level to see topics involved.  Analyze email domains.

Vendor Selection: Is Cost the Only Consideration?
I couldn’t attend this one.

The conference ended with a reception at the top of the Marriott.  The conference also promoted a fundraiser for the victims of the shooting in Orlando.

Highlights from the Southeast eDiscovery & IG Retreat 2016

This retreat was the first one held by Ing3nious in the Southeast.  It was at the Chateau Elan2016_SE_retreat_outside Winery & Resort in Brasel­ton, Geor­gia.  Like all of the e-discovery retreats organized by Chris LaCour, it featured informative panels in a beautiful setting.  My notes below offer a few highlights from the sessions I attended.  There were often two sessions occurring simultaneously, so I couldn’t attend everything.

Peer-to-Peer Roundtables
My table discussed challenges people were facing.  These included NSF files (Lotus Notes), weird native file formats, and 40-year-old documents that had to be scanned and OCRed. Companies having a “retain everything” culture are problematic (e.g., 25,000 backup tapes).  One company had a policy of giving each employee a DVD containing all of their emails when they left the company.  When they got sued they had to hunt down those DVDs to retrieve emails they no longer had.  If a problem (information governance) is too big, nothing will be done at all.  In Canada there are virtually never sanctions, so there is always a fight about handing anything over.2016_SE_retreat_roundtables

Proactive Steps to Cut E-Discovery Costs
I couldn’t attend this one.

The Intersection of Legal and Technical Issues in Litigation Readiness Planning
It is important to establish who you should go to.  Many companies don’t have a plan (figure it out as you go), but it is a growing trend to have one due to data security and litigation risk.  Having an IT / legal liaison is becoming more common.  For litigation readiness, have providers selected in advance.  To get people on board with IG, emphasize cost (dollars) vs. benefit (risk).  Should have an IG policy about mobile devices, but they are still challenging.  Worry about data disposition by a third party provider when the case is over.  Educate people about company policies.2016_SE_retreat_panel

Examining Your Tools & Leveraging Them for Proactive Information Governance Strategy
I couldn’t attend this one.

Got Data? Analytics to the Rescue
Only 56% of in-house counsel use analytics, but 93% think it would be useful.  Use foreign language identification at start to know what you are dealing with.  Be careful about coded language (e.g., language about fantasy sports that really means something else) — don’t cull it!  Graph who is speaking to whom.  Who are emails being forwarded to?  Use clustering to find themes.  Use assisted redaction of PII, but humans should validate the result (this approach gives a 33% reduction in time).  Re-OCR after redaction to make sure it is really gone.  Alex Ponce de Leon from Google said they apply predictive coding immediately as early-case assessment and know the situation and critical documents before hiring outside counsel (many corporate attorneys in the audience turned green with envy).  Predictive coding is also useful when you are the requesting party.  Use email threading to identify related emails.  The requesting party may agree to receive just the last email in the thread.  Use analytics and sampling to show the judge the burden of adding custodians and the number of relevant documents expected — this is much better than just throwing around cost numbers.  Use analytics for QC and reviewer analysis.  Is someone reviewing too slow/fast (keep in mind that document type matters, e.g. spreadsheets) or marking too many docs as privileged?

The Power of Analytics: Strategies for Investigations and Beyond
Focus on the story (fact development), not just producing documents.  Context is very important for analyzing instant messages.  Keywords often don’t work for IMs due to misspellings.  Analytics can show patterns and help detect coded language.  Communicate about how emails are being handled — are you producing threads or everything, and are you logging threads or everything (producing and logging may be different).  Regarding transparency, are the seed set and workflow work product?  When working with the DOJ, showed them results for different bands of predictive coding results and they were satisfied with that.  Nobody likes the idea of doing a clawback agreement and skipping privilege review.

Freedom of Speech Isn’t Free…of Consequences
The 1st Amendment prohibits Congress from passing laws restricting speech, but that doesn’t keep companies from putting restrictions on employees.  With social media, cameras everywhere, and the ability of things to go viral (the grape lady was mentioned), companies are concerned about how their reputations could be damaged by employees’ actions, even outside the workplace.  A doctor and a Taco Bell executive were fired due to videos of them attacking Uber drivers.  Employers creating policies curbing employee behavior must be careful about Sec. 8 of the National Labor Relations Act, which prohibits employers from interfering with employees’ Sec. 7 rights to self-organize or join/form a labor organization.  Taken broadly, employers cannot prohibit employees from complaining about working conditions since that could be seen as a step toward organizing.  Employers have to be careful about social media policies or prohibiting employees from talking to the media because of this.  Even a statement in the employee handbook saying employees should be respectful could be problematic because requiring them to be respectful toward their boss could be a violation.  The BYOD policy should not prohibit accessing Facebook (even during work) because Facebook could be used to organize.  On the other hand, employers could face charges of negligent retention/hiring if they don’t police social media.

Generating a Competitive Advantage Through Information Governance: Lessons from the Field
I couldn’t attend this one.

Destruction Zone
The government is getting more sophisticated in its investigations — it is important to give 2016_SE_retreat_insidethem good productions and avoid losing important data.  Check to see if there is a legal hold before discarding old computer systems and when employees leave the company.  It is important to know who the experts are in the company and ensure communication across functions.  Information governance is about maximizing value of information while minimizing risks.  The government is starting to ask for text messages.  Things you might have to preserve in the future include text messages, social media, videos, and virtual reality.  It’s important to note the difference between preserving the text messages by conversation and by custodian (where things would have to be stitched back together to make any sense of the conversation).  Many companies don’t turn on recording of IMs, viewing them as conversational.

Managing E-Discovery as a Small Firm or Solo Practitioner
I couldn’t attend this one.

Overcoming the Objections to Utilizing TAR
I was on this panel, so I didn’t take notes.

Max Schrems, Edward Snowden and the Apple iPhone: Cross-Border Discovery and Information Management Times Are A-Changing
I couldn’t attend this one.

Highlights from the ACEDS 2016 E-Discovery Conference

The conference was held at the Grand Hyatt in New York City this year.  There were two full days of talks, often with several simultaneous sessions.  My notes below provide only a few highlights from the subset of the sessions that I was able to attend.aceds2016_panel

Future Forward Stewardship of Privacy and Security
David Shonka, Acting General Counsel for the FTC discussed several privacy concerns, such as being photographed in public and having that photo end up online.  Court proceedings are made public–should you have to give up your privacy to prosecute a claim or defend against a frivolous claim?  BYOD and working from home on a personal computer present problems for possession, custody, and control of company data.  If there is a lawsuit, what if the person won’t hand over the device/computer?  What about the privacy rights of other people having data on that computer?  Data brokers have 3,000 data points on each household.  Privacy laws are very different in Europe.  Info governance is necessary for security–you must know what you have in order to protect it.

The Art & Science of Computer Forensics: Why Hillary Clinton’s Email & Tom Brady’s Cell Phone Matter
Email headers can be faked–who really sent/received the email?  Cracking an iPhone may fail Daubert depending on how it is done.  SQLite files created by apps may contain deleted info.  IT is not forensics, though some very large companies do have specialists on staff.  When trying to get accepted by the court as an expert, do they help explain reliable principles and methods?  If they made their own software, that could hurt.  They need to be understandable to other experts.  Certifications and relevant training and experience are helpful.  Have they testified before (state, federal)?  Could be bad if they’ve testified for the same client before–seen as biased.  Reports should avoid facts that don’t contribute to the conclusion.  Include screenshots and write clearly.  With BYOD, what happens when the employee leaves and wipes the phone?  Companies might consider limiting website access (no gmail).

The Secrets No One Tells You: Taking Control of Your Time, Projects, Meetings, and Other Workplace Time-Stealers
I couldn’t attend this one.

Ethics Rules for the Tech Attorney
I couldn’t attend this one.

Hiring & Retaining E-Discovery Leaders
I couldn’t attend this one.

Piecing the Puzzle Together: Understanding How Associations can Enhance Your Career
I couldn’t attend this one.

Tracking Terrorism in the Digital Age & Its Lessons for EDiscovery – A Technical Approach
I couldn’t attend this one.

E-Discovery Project Management: Ask Forgiveness, Not Permission
I couldn’t attend this one.

The Limits of Proportionality
I couldn’t attend this one.

What Your Data Governance Team Can Do For You
I couldn’t attend this one.

Financial Industry Roundtable
I couldn’t attend this one.

Using Analytics & Visualizations to Gain Better Insight into Your Data
I couldn’t attend this one.aceds2016_lunch

Defending and Defeating TAR
Rules of Professional Conduct 5.3 says a lawyer must supervise non-lawyers.  Judge doesn’t want to get involved in arguments over e-discovery–work it out yourselves.  After agreeing on an approach like TAR, it is difficult to change course if it turns out to be more expensive than anticipated.  Make sure you understand what can be accomplished.  Every case is different.  Text rich documents are good for TAR.  Excel files may not work as well.  If a vendor claims success with TAR, ask what kind of case, how big it was, and how they trained the system.  Tolerance for transparency depends on who the other side is.  Exchanging seed sets is “almost common practice,” but you can make an argument against disclosing non-relevant documents.  One might be more reluctant to disclose non-relevant documents to a private party (compared to disclosing to the government, where they “won’t go anywhere”).  Recipient of seed documents doesn’t have any way to know if something important was missing from the seed set (see this article for more thoughts on seed set disclosure).  Regulators don’t like culling before TAR is applied.  In the Biomet case, culling was done before TAR and the court did not require the producing party to redo it (in spite of approximately 40% of the relevant documents being lost in the culling).

Training was often done by a subject matter expert in the past.  More and more, contract reviewers are being used.  How to handle foreign language documents?  Should translations be reviewed?  Should the translator work with the reviewer?  Consider excluding training documents having questionable relevance.  When choosing the relevance score threshold that will determine which documents will be reviewed, you can tell how much document review will be required to reach a certain level of recall, so proportionality can be addressed.  “Relevance rank” is a misnomer–it is measuring how similar (in a sense) the document is to relevant documents from the training set.

Judge Peck has argued that Daubert doesn’t apply to TAR, whereas Judge Waxse has argued that it does apply (neither of them were present).  Judge Hedges thinks Waxse is right.  TAR is not well defined–definitions vary and some are very broad.  If some level of recall is reached, like 80%, the 20% that was missed could contain something critical.  It is important to ensure that metrics are measuring the right thing.  The lawyer overseeing e-discovery should QC the results and should know what the document population looks like.

Managing Your Project Manager’s Project Manager: Who’s On First?
I couldn’t attend this one.

E-Discovery & Compliance
I couldn’t attend this one.

Solving the Privilege Problem
I couldn’t attend this one.

What Your Data Governance Team Can Do For You
I couldn’t attend this one.

The Anatomy of a Tweet
Interpreting content from social media is challenging.  Emojis could be important, though they are now passé (post a photo of yourself making the face instead).  You can usually only collect public info unless it is your client’s account.  Social media can be used to show that someone wasn’t working at a particular time.  A smartphone may contain additional info about social media use that is not available from the website (number of tweets can reveal existence of private tweets or tweets that were posted and deleted).  Some tools for collecting from social media are X1, Nextpoint, BIA, and Hanzo.  They are all different and are suitable for different purposes.  You may want to collect metadata for analysis, but may want to present a screenshot of the webpage in court because it will be more familiar.  Does the account really belong to the person you think it does?

The Essence of E-Discovery Education
I couldn’t attend this one.

Women to Know: What’s Your Pitch / What’s Your Story
I couldn’t attend this one.

Establishing the Parameters of Ethical Conduct in the Legal Technology Industry – LTPI Working Session
I couldn’t attend this one.

Tracking Terrorism in the Digital Age & Its Lessons for EDiscovery – Judicial Perspectives
Judges Francis, Hedges, Rodriguez, and Sciarrino discussed legal issues around Apple not wanting to crack the iPhone the San Bernardino killers used and other issues around corporate obligations to aid an investigation.  The All Writs Act can compel aid if it is aceds2016_judgesnot too burdensome.  The Communications Assistance for Law Enforcement Act of 1992 (CALEA) may be a factor.  Apple has claimed two burdens: 1) the engineering labor required, and 2) its business would be put at a competitive disadvantage if it cracked the phone because of damage to its reputation (though nobody would have known if they hadn’t taken it to court).   They dropped (1) eventually.  The government ultimately dropped the case because they cracked the phone without Apple’s help.  Questions that should be asked are how much content could be gotten without cracking the phone (e.g., from cloud backup, though the FBI messed that up by changing a password), and what do you think you will find that is new?  Microsoft is suing to be allowed to tell a target that their info has been requested by the government.  What is Microsoft’s motivation for this?  An audience member suggested it may be to improve their image in privacy-conscious Germany.  Congress should clarify companies’ obligations.

EDna Challenge Part 2
The EDna challenge attempted to find low-cost options for handling e-discovery for a small case.  The challenge was recently revisited with updated parameters and requirements.  SaaS options from CSDisco, Logikull, and Lexbe were too opaque about pricing to evaluate.  The SaaS offering from Cloudnine came in at $4660, which includes training.  The SaaS offering from Everlaw came in at $2205.  Options for local applications included Prooffinder by Nuix at $600 (which goes to charity) and Intella by Vound at $4780.  Digital WarRoom has apparently dropped their express version, so they came in above the allowed price limit at $8970.  FreeEed.org is an open source option that is free aside from AWS cloud hosting costs.  Some questioned the security of using a solution like FreeEed in the cloud.  Compared to the original EDna challenge, it is now possible to accomplish the goal with purpose-built products instead of cobbling together tools like Adobe Acrobat.  An article by Greg Buckles says one of the biggest challenges is high monthly hosting charges.

“Bring it In” House
I couldn’t attend this one.

The Living Dead of E-Discovery
I couldn’t attend this one.

The Crystal “Ball”: A Look Into the Future of E-Discovery
Craig Ball pointed out that data is growing at 40% per year.  It is important to be aware of all of the potential sources of evidence.  For example, you cannot disable a phone’s geolocation capability because it is needed for 911 calls.  You may be able to establish someone’s location from their phone pinging WiFi.  The average person uses Facebook 14 times per day, so that provides a record of their activity.  We may be recorded by police body cameras, Google Glass, and maybe someday by drones that follow us around.  Car infotainment systems store a lot of information.  NFC passive tags may be found in the soles of your new shoes.  These things aren’t documents–you can’t print them out.  Why are lawyers so afraid of such data when it can lead to the truth?  Here are some things that will change in the future.

Changing of the guard:  Judge Facciola retired and Judge Scheindlin will retire soon.  Retraction of e-discovery before it explodes: New rules create safe harbors–need to prove the producing party failed on purpose.  Analytics “baked into” the IT infrastructure: Microsoft’s purchase of Equivio.  Lawyers may someday be able to look at a safe part of the source instead of making a copy to preserve the data.  Collection from devices will diminish due to data being held in the cloud.  Discovery from automobiles will be an emerging challenge.

Traditional approaches to digital forensics will falter.  Deleted files may be recovered from hard disk drives if the sectors are not overwritten with new data, but recovering data from an SSD (solid-state drive) will be much harder or impossible.  I’ll inject my own explanation here:  A data page on a SSD must have old data cleared off before new data can be written to it, which can be time consuming.  To make writing of data faster, newer drives and operating systems support something called TRIM, which allows the operating system to tell the drive to clear off content from a deleted file immediately so there will no be no slowness introduced by clearing it later when new data must be written.  So a SSD with TRIM will erase the file content shortly after the file is deleted, whereas a hard drive with leave it on the disk and simply overwrite it later if the space is needed to hold new data.  For more on forensics with SSDs see this article.

Encryption will become a formidable barrier.  Lawyers will miss the shift from words to data (e.g., fail to account for the importance of emoticons when analyzing communications).  Privacy will impact the scope of discovery.

Metrics That Matter
I couldn’t attend this one.

Avoiding Sanctions in 2016
I couldn’t attend this one.

Master Class: Interviewing in eDiscovery
I couldn’t attend this one.

E-Discovery & Pro-Bono Workshop
I couldn’t attend this one.

 

Comments on Pyrrho Investments v. MWB Property and TAR vs. Manual Review

A recent decision by Master Matthews in Pyrrho Investments v. MWB Property seems to be the first judgment by a UK court allowing the use of predictive coding.  This article comments on a few aspects of the decision, especially the conclusion about how predictive coding (or TAR) performs compared to manual review.

The decision argues that predictive coding is not prohibited by English law and that it is reasonable based on proportionality, the details of the case, and expected accuracy compared to manual review.  It recaps the Da Silva Moore v. Publicis Group case from the US starting at paragraph 26, and the Irish Bank Resolution Corporation v. Quinn case from Ireland starting at paragraph 31.

Paragraph 33 enumerates ten reasons for approving predictive coding.  The second reason on the list is:

There is no evidence to show that the use of predictive coding software leads to less accurate disclosure being given than, say, manual review alone or keyword searches and manual review combined, and indeed there is some evidence (referred to in the US and Irish cases to which I referred above) to the contrary.

The evidence referenced includes the famous Grossman & Cormack JOLT study, but that study only analyzed the TAR systems from TREC 2009 that had the best results.  If you look at all of the TAR results from TREC 2009, as I did in Appendix A of my book, many of the TAR systems found fewer relevant documents (albeit at much lower cost) than humans performing manual review. This figure shows the number of relevant documents found:

Number of relevant documents found for five categorization tasks. The vertical scale always starts at zero. Manual review by humans is labeled "H." TAR systems analyzed by Grossman and Cormack are "UW" and "H5." Error bars are 95% confidence intervals.

Number of relevant documents found for five categorization tasks. The vertical scale always starts at zero. Manual review by humans is labeled “H.” TAR systems analyzed by Grossman and Cormack are “UW” and “H5.” Error bars are 95% confidence intervals.

If a TAR system generates relevance scores rather than binary yes/no relevance predictions, any desired recall can be achieved by producing all documents having relevance scores above an appropriately calculated cutoff.  Aiming for high recall with a system that is not working well may mean producing a lot of non-relevant documents or performing a lot of human review on the documents predicted to be relevant (i.e., documents above the relevance score cutoff) to filter out the large number of non-relevant documents that the system failed to separate from the relevant ones (possibly losing some relevant documents in the process due to reviewer mistakes).  If it is possible (through enough effort) to achieve high recall with a system that is performing poorly, why were so many TAR results far below the manual review results?  TREC 2009 participants were told they should aim to maximize their F1 scores (F1 is not a good choice for e-discovery).  Effectively, participants were told to choose their relevance score cutoffs in a way that tried to balance the desire for high recall with other concerns (high precision).  If a system wasn’t performing well, maximizing F1 meant either accepting low recall or reviewing a huge number of documents to achieve high recall without allowing too many non-relevant documents to slip into the production.

The key point is that the number of relevant documents found depends on how the system is used (e.g., how the relevance score cutoff is chosen).  The amount of effort required (amount of human document review) to achieve a desired level of recall depends on how well the system and training methodology work, which can vary quite a bit (see this article).  Achieving results that are better than manual review (in terms of the number of relevant documents found) does not happen automatically just because you wave the word “TAR” around.  You either need a system that works well for the task at hand, or you need to be willing to push a poor system far enough (low relevance score cutoff and lots of document review) to achieve good recall.  The figure above should make it clear that it is possible for TAR to give results that fall far short of manual review if it is not pushed hard enough.

The discussion above focuses on the quality of the result, but the cost of achieving the result is obviously a significant factor.  Page 14 of the decision says the case involves over 3 million documents and the cost of the predictive coding software is estimated to be between £181,988 and £469,049 (plus hosting costs) depending on factors like the number of documents culled via keyword search.  If we assume the high end of the price range applies to 3 million documents, that works out to $0.22 per document, which is about ten times what it could be if they shopped around, but still much cheaper than human review.

TAR 3.0 Performance

This article reviews TAR 1.0, 2.0, and the new TAR 3.0 workflow.  It then compares performance on seven categorization tasks of varying prevalence and difficulty.  You may find it useful to read my article on gain curves before reading this one.

In some circumstances it may be acceptable to produce documents without reviewing all of them.  Perhaps it is expected that there are no privileged documents among the custodians involved, or maybe it is believed that potentially privileged documents will be easy to find via some mechanism like analyzing email senders and recipients.  Maybe there is little concern that trade secrets or evidence of bad acts unrelated to the litigation will be revealed if some non-relevant documents are produced.  In such situations you are faced with a dilemma when choosing a predictive coding workflow.  The TAR 1.0 workflow allows documents to be produced without review, so there is potential for substantial savings if TAR 1.0 works well for the case in question, but TAR 1.0 sometimes doesn’t work well, especially when prevalence is low.  TAR 2.0 doesn’t really support producing documents without reviewing them, but it is usually much more efficient than TAR 1.0 if all documents that are predicted to be relevant will be reviewed, especially if the task is difficult or prevalence is low.

TAR 1.0 involves a fair amount of up-front investment in reviewing control set documents and training documents before you can tell whether it is going to work well enough to produce a substantial number of documents without reviewing them.  If you find that TAR 1.0 isn’t working well enough to avoid reviewing documents that will be produced (too many non-relevant documents would slip into the production) and you resign yourself to reviewing everything that is predicted to be relevant, you’ll end up reviewing more documents with TAR 1.0 than you would have with TAR 2.0.  Switching from TAR 1.0 to TAR 2.0 midstream is less efficient than starting with TAR 2.0. Whether you choose TAR 1.0 or TAR 2.0, it is possible that you could have done less document review if you had made the opposite choice (if you know up front that you will have to review all documents that will be produced due to the circumstances of the case, TAR 2.0 is almost certainly the better choice as far as efficiency is concerned).

TAR 3.0 solves the dilemma by providing high efficiency regardless of whether or not you end up reviewing all of the documents that will be produced.  You don’t have to guess which workflow to use and suffer poor efficiency if you are wrong about whether or not producing documents without reviewing them will be feasible.  Before jumping into the performance numbers, here is a summary of the workflows (you can find some related animations and discussion in the recording of my recent webinar):

TAR 1.0 involves a training phase followed by a review phase with a control set being used to determine the optimal point when you should switch from training to review.  The system no longer learns once the training phase is completed.  The control set is a random set of documents that have been reviewed and marked as relevant or non-relevant.  The control set documents are not used to train the system.  They are used to assess the system’s predictions so training can be terminated when the benefits of additional training no longer outweigh the cost of additional training.  Training can be with randomly selected documents, known as Simple Passive Learning (SPL), or it can involve documents chosen by the system to optimize learning efficiency, known as Simple Active Learning (SAL).

TAR 2.0 uses an approach called Continuous Active Learning (CAL), meaning that there is no separation between training and review–the system continues to learn throughout.  While many approaches may be used to select documents for review, a significant component of CAL is many iterations of predicting which documents are most likely to be relevant, reviewing them, and updating the predictions.  Unlike TAR 1.0, TAR 2.0 tends to be very efficient even when prevalence is low.  Since there is no separation between training and review, TAR 2.0 does not require a control set.  Generating a control set can involve reviewing a large (especially when prevalence is low) number of non-relevant documents, so avoiding control sets is desirable.

TAR 3.0 requires a high-quality conceptual clustering algorithm that forms narrowly focused clusters of fixed size in concept space.  It applies the TAR 2.0 methodology to just the cluster centers, which ensures that a diverse set of potentially relevant documents are reviewed.  Once no more relevant cluster centers can be found, the reviewed cluster centers are used as training documents to make predictions for the full document population.  There is no need for a control set–the system is well-trained when no additional relevant cluster centers can be found. Analysis of the cluster centers that were reviewed provides an estimate of the prevalence and the number of non-relevant documents that would be produced if documents were produced based purely on the predictions without human review.  The user can decide to produce documents (not identified as potentially privileged) without review, similar to SAL from TAR 1.0 (but without a control set), or he/she can decide to review documents that have too much risk of being non-relevant (which can be used as additional training for the system, i.e., CAL).  The key point is that the user has the info he/she needs to make a decision about how to proceed after completing review of the cluster centers that are likely to be relevant, and nothing done before that point becomes invalidated by the decision (compare to starting with TAR 1.0, reviewing a control set, finding that the predictions aren’t good enough to produce documents without review, and then switching to TAR 2.0, which renders the control set virtually useless).

The table below shows the amount of document review required to reach 75% recall for seven categorization tasks with widely varying prevalence and difficulty.  Performance differences between CAL and non-CAL approaches tend to be larger if a higher recall target is chosen.  The document population is 100,000 news articles without dupes or near-dupes.  “Min Total Review” is the number of documents requiring review (training documents and control set if applicable) if all documents predicted to be relevant will be produced without review.  “Max Total Review” is the number of documents requiring review if all documents predicted to be relevant will be reviewed before production.  None of the results include review of statistical samples used to measure recall, which would be the same for all workflows.

Task 1 2 3 4 5 6 7
Prevalence 6.9% 4.1% 2.9% 1.1% 0.68% 0.52% 0.32%
TAR 1.0 SPL Control Set 300 500 700 1,800 3,000 3,900 6,200
Training (Random) 1,000 300 6,000 3,000 1,000 4,000 12,000
Review Phase 9,500 4,400 9,100 4,400 900 9,800 2,900
Min Total Review 1,300 800 6,700 4,800 4,000 7,900 18,200
Max Total Review 10,800 5,200 15,800 9,200 4,900 17,700 21,100
TAR 3.0 SAL Training (Cluster Centers) 400 500 600 300 200 500 300
Review Phase 8,000 3,000 12,000 4,200 900 8,000 7,300
Min Total Review 400 500 600 300 200 500 300
Max Total Review 8,400 3,500 12,600 4,500 1,100 8,500 7,600
TAR 3.0 CAL Training (Cluster Centers) 400 500 600 300 200 500 300
Training + Review 7,000 3,000 6,700 2,400 900 3,300 1,400
Total Review 7,400 3,500 7,300 2,700 1,100 3,800 1,700
tar3_min_review

Producing documents without review with TAR 1.0 sometimes results in much less document review than using TAR 2.0 (which requires reviewing everything that will be produced), but sometimes TAR 2.0 requires less review.

tar3_max_review

The size of the control set for TAR 1.0 was chosen so that it would contain approximately 20 relevant documents, so low prevalence requires a large control set.  Note that the control set size was chosen based on the assumption that it would be used only to measure changes in prediction quality.  If the control set will be used for other things, such as recall estimation, it needs to be larger.

The number of random training documents used in TAR 1.0 was chosen to minimize the Max Total Review result (see my article on gain curves for related discussion).  This minimizes total review cost if all documents predicted to be relevant will be reviewed and if the cost of reviewing documents in the training phase and review phase are the same.  If training documents will be reviewed by an expensive subject matter expert and the review phase will be performed by less expensive reviewers, the optimal amount of training will be different.  If documents predicted to be relevant won’t be reviewed before production, the optimal amount of training will also be different (and more subjective), but I kept the training the same when computing Min Total Review values.

The optimal number of training documents for TAR 1.0 varied greatly for different tasks, ranging from 300 to 12,000.  This should make it clear that there is no magic number of training documents that is appropriate for all projects.  This is also why TAR 1.0 requires a control set–the optimal amount of training must be measured.

The results labeled TAR 3.0 SAL come from terminating learning once the review of cluster centers is complete, which is appropriate if documents will be produced without review (Min Total Review).  The Max Total Review value for TAR 3.0 SAL tells you how much review would be required if you reviewed all documents predicted to be relevant but did not allow the system to learn from that review, which is useful to compare to the TAR 3.0 CAL result where learning is allowed to continue throughout.  In some cases where the categorization task is relatively easy (tasks 2 and 5) the extra learning from CAL has no benefit unless the target recall is very high.  In other cases CAL reduces review significantly.

I have not included TAR 2.0 in the table because the efficiency of TAR 2.0 with a small seed set (a single relevant document is enough) is virtually indistinguishable from the TAR 3.0 CAL results that are shown.  Once you start turning the CAL crank the system will quickly head toward the relevant documents that are easiest for the classification algorithm to identify, and feeding those documents back in for training quickly floods out the influence of the seed set you started with.  The only way to change the efficiency of CAL, aside from changing the software’s algorithms, is to waste time reviewing a large seed set that is less effective for learning than the documents that the algorithm would have chosen itself.  The training done by TAR 3.0 with cluster centers is highly effective for learning, so there is no wasted effort in reviewing those documents.

To illustrate the dilemma I pointed out at the beginning of the article, consider task 2.  The table shows that prevalence is 4.1%, so there are 4,100 relevant documents in the population of 100,000 documents.  To achieve 75% recall, we would need to find 3,075 relevant documents.  Some of the relevant documents will be found in the control set and the training set, but most will be found in the review phase.  The review phase involves 4,400 documents.  If we produce all of them without review, most of the produced documents will be relevant (3,075 out of a little more than 4,400).  TAR 1.0 would require review of only 800 documents for the training and control sets.  By contrast, TAR 2.0 (I’ll use the Total Review value for TAR 3 CAL as the TAR 2.0 result) would produce 3,075 relevant documents with no non-relevant ones (assuming no mistakes by the reviewer), but it would involve reviewing 3,500 documents.  TAR 1.0 was better than TAR 2.0 in this case (if producing over a thousand non-relevant documents is acceptable).  TAR 3.0 would have been an even better choice because it required review of only 500 documents (cluster centers) and it would have produced fewer non-relevant documents since the review phase would involve only 3,000 documents.

Next, consider task 6.  If all 9,800 documents in the review phase of TAR 1.0 were produced without review, most of the production would be non-relevant documents since there are only 520 relevant documents (prevalence is 0.52%) in the entire population!  That shameful production would occur after reviewing 7,900 documents for training and the control set, assuming you didn’t recognize the impending disaster and abort before getting that far.  Had you started with TAR 2.0, you could have had a clean (no non-relevant documents) production after reviewing just 3,800 documents.  With TAR 3.0 you would realize that producing documents without review wasn’t feasible after reviewing 500 cluster center documents and you would proceed with CAL, reviewing a total of 3,800 documents to get a clean production.

Task 5 is interesting because production without review is feasible (but not great) with respect to the number of non-relevant documents that would be produced, but TAR 1.0 is so inefficient when prevalence is low that you would be better off using TAR 2.0.  TAR 2.0 would require reviewing 1,100 documents for a clean production, whereas TAR 1.0 would require reviewing 3,000 documents for just the control set!  TAR 3.0 beats them both, requiring review of just 200 cluster centers for a somewhat dirty production.

It is worth considering how the results might change with a larger document population.  If everything else remained the same (prevalence and difficulty of the categorization task), the size of the control set required would not change, and the number of training documents required would probably not change very much, but the number of documents involved in the review phase would increase in proportion to the size of the population, so the cost savings from being able to produce documents without reviewing them would be much larger.

In summary, TAR 1.0 gives the user the option to produce documents without reviewing them, but its efficiency is poor, especially when prevalence is low.  Although the number of training documents required for TAR 1.0 when prevalence is low can be reduced by using active learning (not examined in this article) instead of documents chosen randomly for training, TAR 1.0 is still stuck with the albatross of the control set dragging down efficiency.  In some cases (tasks 5, 6, and 7) the control set by itself requires more review labor than the entire document review using CAL.  TAR 2.0 is vastly more efficient than TAR 1.0 if you plan to review all of the documents that are predicted to be relevant, but it doesn’t provide the option to produce documents without reviewing them.  TAR 3.0 borrows some of best aspects of both TAR 1.0 and 2.0.  When all documents that are candidates for production will be reviewed, TAR 3.0 with CAL is just as efficient as TAR 2.0 and has the added benefits of providing a prevalence estimate and a diverse early view of relevant documents.  When it is permissible to produce some documents without reviewing them, TAR 3.0 provides that capability with much better efficiency than TAR 1.0 due to its efficient training and elimination of the control set.

If you like graphs, the gain curves for all seven tasks are shown below.  Documents used for training are represented by solid lines, and documents not used for training are shown as dashed lines.  Dashed lines represent documents that could be produced without review if that is appropriate for the case.  A green dot is placed at the end of the review of cluster centers–this is the point where the TAR 3.0 SAL and TAR 3.0 CAL curves diverge, but sometimes they are so close together that it is hard to distinguish them without the dot.  Note that review of documents for control sets is not reflected in the gain curves, so the TAR 1.0 results require more document review than is implied by the curves.

Task 1. Prevalence is 6.9%.

Task 1. Prevalence is 6.9%.

Task 2. Prevalence is 4.1%.

Task 2. Prevalence is 4.1%.

Task 3. Prevalence is 2.9%.

Task 3. Prevalence is 2.9%.

Task 4. Prevalence is 1.1%.

Task 4. Prevalence is 1.1%.

Task 5. Prevalence is 0.68%.

Task 5. Prevalence is 0.68%.

Task 6. Prevalence is 0.52%.

Task 6. Prevalence is 0.52%.

Task 7. Prevalence is 0.32%.

Task 7. Prevalence is 0.32%.