# Highlights from the Northeast IG Retreat 2017

The 2017 Northeast Information Governance Retreat was held at the Salamander Resort & Spa in Middleburg, Virginia.  After round table discussions, the retreat featured two simultaneous sessions throughout the day. My notes below provide some highlights from the sessions I was able to attend.

Enhancing eDiscovery With Next Generation Litigation Management Software
I couldn’t attend this

Legal Tech and AI – Inventing The Future
Machines are currently only good a routine tasks.  Interactions with machines should allow humans and machines to do what they do best.  Some areas where AI can aid lawyers: determining how long litigation will take, suggesting cases you should reference, telling how often the opposition has won in the past, determining appropriate prices for fixed fee arrangements, recruiting, or determining which industry on which to focus.  AI promises to help with managing data (e.g., targeted deletion), not just e-discovery.  Facial recognition may replace plane tickets someday.

Zen & The Art Of Multi-Language Discovery: Risks, Review & Translation
I couldn’t attend this

NexLP Demo
The NexLP tool emphasizes feature extraction and use of domain knowledge from external sources to figure out the story behind the data.  It can generate alerts based on changes in employee behavior over time.  Company should have a policy allowing the scanning of emails to detect bad behavior.  It was claimed that using AI on emails is better for privacy than having a human review random emails since it keeps human eyes away from emails that are not relevant.

TAR: What Have We Learned?
I moderated this panel, so I didn’t take notes.

Are Managed Services Manageable?
I couldn’t attend this

Cyber And Data Security For The GC: How To Stay Out Of Headlines And Crosshairs
I couldn’t attend this

The Office Is Out: Preservation And Collection In The Merry Old LandOf Office 365
Enterprise 5 (E5) has advanced analytics from Equivio.  E3 and E1 can do legal hold but don’t have advanced analytics.  There are options available that are not on the website, and there are different builds — people are not all using the same thing.  Search functionality works on limited file types (e.g., Microsoft products).  Email attachments are OK if they are from Microsoft products.  It will not OCR PDFs that lack embedded text.  What about emails attached to emails?  Previously, it only went one layer deep on attachments.  Latest versions say they are “relaxing” that, but it is unclear what that means (how deep?).  User controls sync — are we really searching everything?  Make sure you involve IT, privacy, info governance, etc. if considering transition to 365.  Be aware of data that is already on hold if you migrate to 365.  Start by migrating a small group of people that are not often subject to litigation.  Test each data type after conversion.

How To Make Sense Of Information Governance Rules For Contractors When The Government Itself Can’t?
I couldn’t attend this

Judges, The Law And Guidance: Does ‘Reasonableness’ Provide Clarity?
This was primarily about the impact of the new Federal rules of civil procedure.  Clients are finally giving up on putting everything on hold.  Tie document retention to business needs — shouldn’t have to worry about sanctions.  Document everything (e.g., why you chose specific custodians to hold).  Accidentally missing one custodian out of a hundred is now OK.  Some judges acknowledge the new rules but then ignore them.  Boilerplate objections to discovery requests needs to stop — keep notes on why you made each objection.

Beyond The Firewall: Cybersecurity & The Human Factor
I couldn’t attend this

The Theory of Relativity: Is There A Black Hole In Electronic Discovery?
The good about Relativity: everyone knows it, it has plug-ins, and moving from document to document is fast compared to previous tools.  The bad: TAR 1.0 (federal judiciary prefers CAL).  An audience member expressed concern that as Relativity gets close to having a monopoly we should expect high prices and a lack of innovation.  Relativity One puts kCura in competition with service providers.

The day ended with a wine social.

# Highlights from Ipro Innovations 2017

The 16th annual Ipro Innovations conference was held at the Talking Stick Resort.  It was a well-organized conference with over 500 attendees, lots of good food and swag, and over two days worth of content.  Sometimes, everyone attended the same presentation in a large hall.  Other times, there were seven simultaneous breakout sessions.  My notes below cover only the small subset of the presentations that I was able to attend.  I visited the Ipro office on the final day.  It’s an impressive, modern office with lots of character.  If you are wondering whether the Ipro people have a sense of humor, you need look no farther than the signs for the restrooms.

The conference started with a summary of recent changes to the Ipro software line-up, how it enables a much smaller team to manage large projects, and stats on the growing customer base.  They announced that Clustify will soon replace Content Analyst as their analytics engine.  In the first phase, both engines will be available and will be implemented similarly, so the user can choose which one to use.  Later phases will make more of Clustify’s unique functionality available.  They announced an investment by ParkerGale Capital.  Operations will largely remain unchanged, but there may be some acquisitions.  The first evening ended with a party at Top Golf.

Ari Kaplan gave a presentation entitled “The Opportunity Maker,” where he told numerous entertaining stories about business problems and how to find opportunities.  He explained that doing things that nobody else does can create opportunities.  He contacts strangers from his law school on LinkedIn and asks them to meet for coffee when he travels to their town — many accept because “nobody does that.”  He sends postscards to his clients when traveling, and they actually keep them.  To illustrate the value of putting yourself into the path of opportunity, he described how he got to see the Mets in the World Series.  He mentioned HelpAReporter.com as a way to get exposure for yourself as an expert.

One of the tracks during the breakout sessions was run by The Sedona Conference and offered CLE credits.  One of the TSC presentations was “Understanding the Science & Math Behind TAR” by Maura Grossman.  She covered the basics like TAR 1.0 vs. 2.0, human review achieving roughly 70% recall due to mistakes, and how TAR performs compared to keyword search.  She mentioned that control sets can become stale because the reviewer’s concept of relevance may shift during the review.  People tend to get pickier about relevance as the review progresses, so an estimate of the number of relevant docs taken on a control set at the beginning may be too high.  She also warned that making multiple measurements against the control set can give a biased estimate about when a certain level of performance is achieved (sidenote: this is because people watch for a measure like F1 to cross a threshold to determine training completeness, which is not the best way to use a control set).  She mentioned that she and Cormack have a new paper coming out that compares human review to TAR using better-reviewed data (Tim Kaine’s emails) that addresses some criticisms of their earlier JOLT study.

There were also breakout sessions where attendees could use the Ipro software with guidance from the staff in a room full of computers.  I attended a session on ECA/EDA.  One interesting feature that was demonstrated was checking the number of documents matching a keyword search that did not match any of the other searches performed — if the number is large, it may not be a very good search query.

Another TSC session I attended was by Brady, Grossman, and Shonka on responding to government and internal investigations.  Often (maybe 20% of the time) the government is inquiring because you are a source of information, not the target of the investigation, so it may be unwise to raise suspicion by resisting the request.  There is nothing similar to the Federal Rules of Civil Procedure for investigations.  The scope of an investigation can be much broader than civil discovery.  There is nothing like rule 502 (protecting privilege) for investigations.  The federal government is pretty open to the use of TAR (don’t want to receive a document dump), though the DOJ may want transparency.  There may be questions about how some data types (like text messages) were handled.  State agencies can be more difficult.

The last session I attended was the analytics roundtable, where Ipro employees asked the audience questions about how they were using the software and solicited suggestions for how it could be improved.  The day ended with the Salsa Challenge (as in food, not dancing) and dinner.  I wasn’t able to attend the presentations on the final day, but the schedule looked interesting.

# Highlights from the East Coast eDiscovery & IG Retreat 2015

This was the second year that Ing3nious has held a retreat on the east coast, with other events organized by Chris LaCour held in California going back five years.  The event was held at the Wequassett Resort in Cape Cod.  As always, the event was well-organized and the location was beautiful.  Luckily, the weather was fantastic.  My notes below only capture a small amount of the information presented. There were often two simultaneous sessions, so I couldn’t attend everything.

Keynote: Away with Words: The Myths and Misnomers of Conventional Search Strategies

Thomas Barnett started the keynote by asking the audience to suggest keyword searches to find items discussing the meaning of existence.  He then said that he had in mind “to be, or not to be” and pointed out that it contains only stop words.  He then described unsupervised (clustering) and supervised (predictive coding) machine learning.  He talked about entity extraction, meaning the identification of dates and names of people and companies in a document.  He talked about sentiment analysis and how a person might change their language when they are doing something wrong.  He also pointed out that a product may have different names in different countries, which can make it easy to miss things with keyword search.

Advancing Discovery: What if Lawyers are the Problem?

I couldn’t attend this one.

Turbulent Sea in the Safe Harbor.  Is There a Lifeboat for Transfers of EU Data to the US?

Max Schrems complained to the Irish Data Protections Commissioner 22 times about the Safe Harbor Privacy Principles failing to protect the privacy of E.U. citizens’ data when companies move the data to the U.S..  After Snowden released information on NSA data collection, Schrems complained a 23rd time.  Ultimately, a judge found the Safe Harbor to be invalid.

Companies must certify to the Department of Commerce that they will adhere to the Safe Harbor Privacy Principles.  Many e-discovery service providers were pressured to certify so they could bring data to the U.S. for discovery even though e-discovery usage of the data would involve very bad privacy violations.

Some argue that there is no other legal mechanism that could work for bringing data to the U.S. because the U.S. government can pick up everything, so no guarantees above privacy can be made.   The best option would be to get consent from the person, but it must be done in a very clear manner specifying what data and who will see it.  An employer asking an employee for consent would be seen as coercive.  It will be hard to get consent from someone if you are investigating them for criminal activity.

There is really no way to move data from Europe to the U.S. for litigation without violating the law.  Consent would be required not just from the custodian but from everyone in the emails.  Some countries (France, Germany, and Switzerland) have blocking statutes that make taking the data a criminal offense.

Ethics: eDiscovery, Social Media, and the Internet of Things

I couldn’t attend this one.

Understanding the Data Visualization Trend in Legal

I was on this panel, so I didn’t take notes.  I did mention Vischeck, which allows you to see what your graphics would look like to a color-blind person.

Information Governance – How Do You Eat an Elephant?

I couldn’t attend this one.

Email Laws, IG Policies and the “Smoking Gun”

There has been confusion over what should be considered a record.  In the past, emails that were considered to be records were printed and stored.  Now email should be considered to be a record by default.  30-day retention policies are hard to defend.  Keep deleted emails for 60 days and use analytics to identify emails that employees should not have deleted so they can be saved.  Use automated logging to show compliance.

Protecting Enterprise Data Across Partners, Providers and the Planet

I couldn’t attend this one.

Defeating Analysis Paralysis – Strategies and Success Stories for Implementing IG Policies and Using TAR / Data Analytics

Berkeley Research Group finds that most companies are still keeping everything.  The longer data is kept, the less value it has to the company and the more risk it poses (ediscovery cost and privacy issues if there is a breach).  Different departments within the company may want different retention rules.  Breaches cost the company in lawsuits and in reputation.  The E.U. requires breach notification within 24 hours.

Having employees tag documents gives low-quality tags (they aren’t lawyers), but retention based on those tags is good enough to satisfy the court.  Need employees to follow the retention policy, so keep it simple.  Some speculate that insurance providers may end up driving info governance by forcing their clients to do it.

The Coalition of Technology Resources for Lawyers found that 56% of legal departments are reporting that they use analytics.  Clustering can help with investigation and determining search terms.  Look at email domain names (e.g., nytimes.com) to cull.  Note that email journaling keeps everything.  Analytics technology has improved, so if you were disappointed in the past you might want to try it again.

How Automated Digital Discovery is Changing eDiscovery as We Know It

I couldn’t attend this one.

Creating Order Out of Chaos: Framing and Taming Data Discovery Challenges in Expedited Matters

This panel started by walking through a (hypothetical?) investigation of a head of operations who left and joined a competitor in violation of a non-compete agreement that was determined to be unenforceable.  Did he transfer company data to the competitor?

Look for evidence that USB devices were used on the company laptop.  Unfortunately, you can’t tell what was copied onto them.  Look for attempts to hide what was done, such as removal of USB insertion data from the current registry (but failing to remove from the registry snapshot).  Look at the WiFi connection history for connections to the competitor’s network.  It is very important to explain the situation to the forensics person and communicate with him/her frequently about what you each have found in order to develop a picture of what actually happened.

If you hire someone from a competitor and there is suspicion that they took data from their previous employer, ambush them and take all their devices before they have a chance to destroy anything.  This will show the judge that you were not complicit.

When investigating someone who quit on bad terms, look for deals with “special terms” or side letter deals — they may be a sign of fraud.  Be careful about any applicable European laws.  Europe says you can’t move the data to the U.S., but the SEC doesn’t care.  Can you use a review tool in the U.S. with the data in Europe?  Officially, no, but it is less bad than moving the data.  Everyone says you can’t produce the data from Europe, but everyone does.

Make sure your agreements are up to date and are written by the attorney that will litigate them.

Just Patch and Pray?

A study by Verizon found that 90% of breaches are caused by employees.  Info governance can reduce risk.  Keeping everything is risky due to e-discovery, risk of breach, and having to explain loss of old data to customers.

Email problems include bad passwords, use of the same password on multiple websites so having one hacked can allow access to others, and getting inside the network (emailed malware).  2-factor authentication is recommended.  Don’t send an email to the SEC with BCC to the client or the client might hit reply-all and say something problematic — instead, email only the SEC and forward a copy to the client later.

Mobile technology can create discovery headaches, needs to be managed/updated/wiped remotely, and can easily be lost.  Encrypt, audit, and apply anti-malware.  BYOD should be limited to enterprise-ready devices.  Avoid insecure WiFi.  Control access to enterprise data.  Secure data in transit.  Ensure that devices get updated/upgraded.

Unaware or non-compliant employees need training.  When training to spot phishing emails, services can test the employees by sending phishing emails that report who clicked on them.

Vendors and third parties that handle enterprise data can be a problem.  Regulators require vendor oversight.  Limit access to necessary systems.  Segregate sensitive data.  Beware of payroll vendors and the possibility of identity theft from the data they hold.  Make sure cybersecurity insurance policy covers vendors.

Employees want data access from anywhere.  Encrypting email is hard — better to use collaborative workspaces.  Home networks should be protected.  Don’t use the neighbor’s Internet connection.

After having a breach, 39% of companies still don’t form a response plan.  There is no federal data breach notification law, but many states have such laws.  You may need to notify employees, customers, and the attorney general in some specific time frame.  Also notify your insurance company.

Mergers & Acquisitions: Strategy and Execution Concerns

I couldn’t attend this one.

# ISO approves eDiscovery standards development

According to the Infosecurity article ISO approves eDiscovery standards development, an international standard is being developed that “addresses terminology, provides an overview of eDiscovery and ESI, and then addresses a range of technological and process challenges.”  The first working draft is due in early July, with comments on the draft due in early September.  A linked article in Enterprise Communications elaborates that  “The standard will also look to clarify any issues that aren’t directly dealt with in the Federal Rules of Civil Procedure.”

This is certainly an interesting development.  Since it is an international standard, it will be challenging to make it compatible with many different bodies of law.  Are they really doing something different from what EDRM already does?  Will it get much buy-in from the courts and practitioners in the field?

# Predictive Coding Performance and the Silly F1 Score

This article describes how to measure the performance of predictive coding algorithms for categorizing documents.  It describes the precision and recall metrics, and explains why the F1 score (also known as the F-measure or F-score) is virtually worthless.

Predictive coding algorithms start with a training set of example documents that have been tagged as either relevant or not relevant, and identify words or features that are useful for predicting whether or not other documents are relevant.  “Relevant” will usually mean responsive to a discovery request in litigation, or having a particular issue code, or maybe privileged (although predictive coding may not be well-suited for identifying privileged documents).  Most predictive coding algorithms will generate a relevance score or rank for each document, so you can order the documents with the ones most likely to be relevant (according to the algorithm) coming first and the ones most likely to not be relevant coming toward the end of the list.  If you apply several different algorithms to the same set of documents and generate several ordered lists of documents, what quantities should you compute to assess which algorithm made the best predictions for this document set?

You could select some number of documents, n, from the top of each list and count how many of the documents truly are relevant.  Divide the number of relevant documents by n and you have the precision, i.e. the fraction of selected documents that are relevant.  High precision is good since it means that the algorithm has done a good job of moving the relevant documents to the top of the list.  The other useful thing to know is the recall, which is the fraction of all relevant documents in the document set that were included in the algorithm’s top n documents.  Have we found 80% of the relevant documents, or only 10%?  If the answer is 10%, we probably need to increase n, i.e. select a larger set of top documents, if we are going to argue to a judge that we’re making an honest effort at finding relevant documents.  As we increase n, the recall will increase each time we encounter another document that is truly relevant.  The precision will typically decrease as we increase n because we are including more and more documents that the algorithm is increasingly pessimistic about.  We can measure precision and recall for many different values of n to generate a graph of precision as a function of recall (n is not shown explicitly, but higher recall corresponds to higher n values — the relationship is monotonic but not linear).  Click the graph to view the full-sized version:

The graph shows hypothetical results for three different algorithms.  Focus first on the blue curve representing the first algorithm.  At 10% recall it shows a precision of 69%.  So, if we work our way down from the top of the document list generated by algorithm 1 and review documents until we’ve found 10% of the documents that are truly relevant, we’ll find that 69% of the documents we encounter are truly relevant while 31% are not relevant.  If we continue to work our way down the document list, reviewing documents that the algorithm thinks are less and less likely to be relevant, and eventually get to the point where we’ve encountered 70% of the truly relevant documents (70% recall), 42% of the documents we review along the way will be truly relevant (42% precision) and 58% will not be relevant.

Turn now to the second algorithm, which is shown in green.  For all values of recall it has a lower precision than the first algorithm.  For this document set it is simply inferior (glossing over subtleties like result diversity) to the first algorithm — it returns more irrelevant documents for each truly relevant document it finds, so a human reviewer will need to wade through more junk to attain a desired level of recall.  Of course, algorithm 2 might triumph on a different document set where the features that distinguish a relevant document are different.

The third algorithm, shown in orange, is more of a mixed bag.  For low recall (left side of graph) it has higher precision than any of the other algorithms.  For high recall (right side of graph) it has the worst precision of the three algorithms.  If we were designing a web search engine to compete with Google, algorithm 3 might be pretty attractive because the precision at low recall is far more important than the precision at high recall since most people will only look at the first page or two of search results, not the 1000th page.  E-Discovery is very different from web search in that regard — you need to find most of the relevant documents, not just the 10 or 20 best ones.  Precision at high recall is critical for e-discovery, and that is where algorithm 3 falls flat on its face.  Still, there is some value in having high precision at low recall since it may help you decide early in the review that the evidence against your side is bad enough to warrant settling immediately instead of continuing the review.

You may have noticed that all three algorithms have 15% precision at 100% recall.  Don’t take that to mean that they are in any sense equally good at high recall — they are actually all completely worthless at 100% recall.  In this example, the prevalence of relevant documents is 15%, meaning that 15% of the documents in the entire document set are relevant.  If your algorithm for finding relevant documents was to simply choose documents randomly, you would achieve 15% precision for all recall values.  What makes algorithm 3 a disaster at high recall is the fact that it drops close to 15% precision long before reaching 100% recall, losing all ability to differentiate between documents that are relevant and those that are not relevant.

As alluded to earlier, high precision is desirable to reduce the amount of manual document review.  Let’s make that idea more precise.  Suppose you are the producing party in a case.  You need to produce a large percentage of the responsive documents to satisfy your duty to the court.  You use predictive coding to order the documents based on the algorithm’s prediction of which documents are most likely to be responsive.  You plan to manually review any documents that will be produced to the other side (e.g., to verify responsiveness, check for privilege, perform redactions, or just be aware of the evidence you’ll need to counter in court), so how many documents will you need to review, including non-responsive documents that the algorithm thought were responsive, to reach a reasonable recall?  Here is the formula (excluding training and validation sets):

$\text{fraction\_of\_document\_set\_to\_review} = \frac{\text{prevalence} \times \text{recall}}{\text{precision}}$

The recall is the desired level you want to reach, and the precision is measured at that recall level.  The prevalence is a property of the document set, so the only quantity in the equation that depends on the predictive coding algorithm is the precision at the desired recall.  Here is a graph based on the precision vs. recall relationships from earlier:

If your goal is to find at least 70% of the responsive documents (70% recall), you’ll need to review at least 25% of the documents ranked most highly by algorithm 1.  Keep in mind that only 15% of the whole document set is responsive in our example (i.e. 15% prevalence), so aiming to find 70% of the responsive documents by reviewing 25% of the document set means reviewing 10.5% of the document set that is responsive (70% of 15%) and 14.5% of the document set that is not responsive, which is consistent with our precision-recall graph showing 42% precision at 70% recall (10.5/25 = 0.42) for algorithm 1.  If you had the misfortune of using algorithm 3, you would need to review 50% of the entire document set just to find 70% of the responsive documents.  To achieve 70% recall you would need to review twice as many documents with algorithm 3 compared to algorithm 1 because the precision of algorithm 3 at 70% recall is half the precision of algorithm 1.

Notice how the graph slopes upward more and more rapidly as you aim for higher recall because it becomes harder and harder to find a relevant document as more and more of the low hanging fruit gets picked.  So, what recall should you aim for in an actual case?  This is where you need to discuss the issue of proportionality with the court.  Each additional responsive document is, on average, more expensive than the last one, so a balance must be struck between cost and the desire to find “everything.”  The appropriate balance will depend on the matter being litigated.

We’ve seen that recall is important to demonstrate to the court that you’ve found a substantial percentage of the responsive documents, and we’ve seen that precision determines the number of documents that must be reviewed (hence, the cost) to achieve a desired recall.  People often quote another metric, the F1 score (also known as the F-measure or F-score), which is the harmonic mean of the recall and the precision:

$F_1 = \frac{1}{\frac{1}{2}(\frac{1}{\text{recall}}+\frac{1}{\text{precision}})} = \frac{2 \times \text{precision} \times \text{recall}}{\text{precision} + \text{recall}}$

The F1 score lies between the value of the recall and the value of the precision, and tends to lie closer to the smaller of the two, so high values for the F1 score are only possible if both the precision and recall are large.

Before explaining why the F1 score is pointless for measuring predictive coding performance, let’s consider a case where it makes a little bit of sense.  Suppose we send the same set of patients to two different doctors who will each screen them for breast cancer using the palpation method (feeling for lumps).  The first doctor concludes that 50 of them need further testing, but the additional testing shows that only 3 of them actually have cancer, giving a precision of 6.0% (these numbers are entirely made up and are not necessarily realistic).  The second doctor concludes that 70 of the patients need further testing, but additional testing shows that only 4 of them have cancer, giving a precision of 5.7%.  Which doctor is better at identifying patients in need of additional testing?  The first doctor has higher precision, but that precision is achieved at a lower level of recall (only found 3 cancers instead of 4).  We know that precision tends to decline with increasing recall, so the fact that the second doctor has lower precision does not immediately lead to the conclusion that he/she is less capable.  Since the F1 score combines precision and recall such that increases in one offset (to some degree) decreases in the other, we could compute F1 scores for the two doctors.  To compute F1 we need to compute the recall, which means that we need to know how many of the patients actually have cancer.  If 5 have cancer, the F1 scores for the doctors will be 0.1091 and 0.1067 respectively, so the first doctor scores higher.  If 15 have cancer, the F1 scores will be 0.0923 and 0.0941 respectively, so the second doctor scores higher.  Increasing the number of cancers from 5 to 15 decreases the recall values, bringing them closer to the precision values, which causes the recall to have more impact (relative to the precision) on the F1 score.

The harmonic mean is commonly used to combine rates.  For example, you should be able to convince yourself that the appropriate way to compute the average MPG fuel efficiency rating for a fleet of cars is to take the harmonic mean (not the arithmetic mean) of the MPG values of the individual cars.  But, the F1 score is the harmonic mean of two rates having different meanings, not the same rate measured for two different objects.  It’s like adding the length of your foot to the length of your arm.  They are both lengths, but does the result from adding them really make any sense?  A 10% change in the length of your arm would have much more impact than a 10% change in the length of your foot, so maybe you should add two times the length of your foot to your arm.  Or, maybe add three times the length of your foot to your arm.  The relative weighting of your foot and arm lengths is rather arbitrary since the sum you are calculating doesn’t have any specific use that could nail down the appropriate weighting.  The weighting of precision vs. recall in the F1 score is similarly arbitrary.  If you want to weight the recall more heavily, there is a metric called F2 that does that.  If you want to weight the precision more heavily, F0.5 does that.  In fact, there is a whole spectrum of F measures offering any weighting you want — you can find the formula in Wikipedia.  In our example of doctors screening for cancer, what is the right weighting to appropriately balance the potential loss of life by missing a cancer (low recall) against the cost and suffering of doing additional testing on many more patients that don’t have cancer (higher recall obtained at the expense of lower precision)?  I don’t know the answer, but it is almost certainly not F1.  Likewise, what is the appropriate weighting for predictive coding?  Probably not F1.

Why did we turn to the F1 score when comparing doctors doing cancer screenings?  We did it because we had two different recall values for the doctors, so we couldn’t compare precision values directly.  We used the F1 score to adjust for the tradeoff between precision and recall, but we did so with a weighting that was arbitrary (sometimes pronounced “wrong”).  Why were we stuck with two different recall values for the two doctors?  Unlike a predictive coding algorithm, we can’t ask a doctor to rank a group of patients based on how likely he/she thinks it is that each of them has cancer.  The doctor either feels a lump, or he/she doesn’t.  We might expand the doctor’s options to include “maybe” in addition to “yes” and “no,” but we can’t expect the doctor to say that one patient is a 85.39 score for cancer while another is a 79.82 so we can get a definite ordering. We don’t have that problem (normally) when we want to compare predictive coding algorithms — we can choose whatever recall level we are interested in and measure the precision of all algorithms at that recall, so we can compare apples to apples instead of apples to oranges.

Furthermore, a doctor’s ability to choose an appropriate threshold for sending people for additional testing is part of the measure of his/her ability, so we should allow him/her to decide how many people to send for additional testing, not just which people, and measure whether his/her choice strikes the right balance to achieve the best outcomes, which necessitates comparing different levels of recall for different doctors.  In predictive coding it is not the algorithm’s job to decide when we should stop looking for additional relevant documents — that is dictated by proportionality.  If the litigation is over a relatively small amount of money, a modest target recall may be accepted to keep review costs reasonable relative to the amount of money at stake.  If a great deal of money is at stake, pushing for a high recall that will require reviewing a lot of irrelevant documents may be warranted.  The point is that the appropriate tradeoff between low recall with high precision and high recall with lower precision depends on the economics of the case, so it cannot be captured by a statistic with fixed (arbitrary) weight like the F1 score.

Here is a graph of the F1 score for the three algorithms we’ve been looking at:

Remember that the F1 score can only be large if both the recall and precision are large.  At the left edge of the chart the recall is low, so the F1 score is small.  At the right edge the recall is high but the precision is typically low, so the F1 score is small.  Note that algorithm 1 has its maximum F1 score of 0.264 at 62% recall, while algorithm 3 has its maximum F1 score of 0.242 at 44% recall.  Comparing maximum F1 scores to identify the best algorithm is really an apples to oranges comparison (comparing values at different recall levels), and in this case it would lead you to conclude that algorithm 3 is the second best algorithm when we know that it is by far the worst algorithm at high recall.  Of course, you might retort that algorithms should be compared by comparing F1 scores at the same recall level instead of comparing maximum F1 scores, but the F1 score would really serve no purpose in that case — we could just compare precision values.

In summary, recall and precision are metrics that relate very directly to important aspects of document review — the need to identify a substantial portion of the relevant documents (recall), and the need to keep costs down by avoiding review of irrelevant documents (precision).  A predictive coding algorithm orders the document list to put the documents that are expected to have the best chance of being relevant at the top.  As the reviewer works his/her way down the document list recall will increase (more relevant documents found), but precision will typically decrease (increasing percentage of documents are not relevant).  The F1 score attempts to combine the precision and recall in a way that allows comparisons at different levels of recall by balancing increasing recall against decreasing precision, but it does so with a weighting between the two quantities that is arbitrary rather than reflecting the economics of the case.  It is better to compare algorithm performance by comparing precision at the same level of recall, with the recall chosen to be reasonable for a case.

Note:  You can read more about performance measures here, and there is an article on an alternative to F1 that is more appropriate for e-discovery here.