Highlights from Text Analytics Forum 2019

Text Analytics Forum is part of the KMWorld conference. taf2019_hallIt was held on November 6-7 at the JW Marriott in D.C.. Attendees went to the large KMWorld keynotes in the morning and had two parallel text analytics tracks for the remainder of the day. There was a technical track and an applications track. Most of the slides are available here. My photos, including photos of some slides that caught my attention or were not available on the website, are available here. Since most slides are available online, I have only a few brief highlights below.

Automatic summarization comes in two forms: extracted and generative.  Generative summarization doesn’t work very well, and some products are dropping the feature.  Enron emails containing lies tend to be shorter.  When a customer threatens to cancel a service, the language they use may indicate they are really looking to bargain.  Deep learning works well with data, but not with concepts.  For good results, make use of all document structure (titles, boldface, etc.) — search engines often ignore such details.  Keywords assigned to a document by a human are often unreliable or inconsistent.  Having the document’s author write a summary may be more useful.  Rules work better when there is little content (machine learning prefers more content).  Knowledge graphs, which were a major topic at the conference, are better for discovery than for search.tar2019_monument

DBpedia provides structured data from wikipedia for knowledge graphs.  SPARQL is a standardized language for graph databases similar to SQL for relational databases.  When using knowledge graphs, the more connections away the answer is, the more like it is to be wrong.  Knowledge graphs should always start with a good taxonomy or ontology.

Social media text (e.g., tweets) contains a lot of noise.  Some software handles both social media and normal text, but some only really works with one or the other.  Sentiment analysis can be tripped when only looking at keywords.  For example, consider “product worked terribly” to “I’m terribly happy with the product.”  Humans are only 60-80% accurate at sentiment analysis.

Highlights from Relativity Fest 2019

Relativity Fest relfest2019_keynotecelebrated its tenth anniversary at the Hilton in Chicago.  It featured as many as sixteen simultaneous sessions and was attended by about 2,000 people.  You can find my full set of photos here.

The show was well-organized and there were always plenty of friendly staff around to help.  The keynote introduced the company’s new CEO, Mike Gamson.  Various staff members talked about new functionality that is planned for Relativity.  A live demo of the coming Aero UI highlighted its ability to display very large (dozens of MB) documents quickly.relfest2019_party

I mostly attended the developer sessions.  During the first full day, the sessions I attended were packed and there were people standing in the back.  It thinned out a bit during the remaining days.  The on-premises version of Relativity will be switching from quarterly releases to annual releases because most people don’t want to upgrade so often.  Relativity One will have updates quarterly or faster.  There seems to be a major push to make APIs more uniform and better documented.  There was also a lot of emphasis on reducing breakage of third party tools with new releases.relfest2019_hotel

Highlights from IG3 Mid-Atlantic 2019

The first Mid-Atlantic IG3 was held at the Watergate Hotel in Washington, D.C.. It was a day and a half long with a keynote followed by two concurrent sets of sessions.  I’ve provided some notes below from the sessions I was able to attend.  You can find my full set of photos here.ig3east2019_hotel

Big Foot, Aliens, or a Culture of Governance: Are Any of Them Real?
In 2012 12% of companies had a chief data officer, but now 63.4% do.  Better data management can give insight into the business.  It may also be possible to monetize the data.  Cigna has used Watson, but you do have to put work into teaching it.  Remember the days before GPS, when you had to keep driving directions in your head or use printed maps.  Data is now more available.

Practical Applications of AI and Analytics: Gain Insights to Augment Your Review or End It Early
Opposing counsel may not even agree to threading, so getting approval for AI can be a problem.  If the requesting party is the government, they want everything and they don’t care about the cost to you.  TAR 2.0 allows you to jump into review right away with no delay for training by an expert, and it is becoming much more common.  TAR 1.0 is still used for second requests [presumably to produce documents without review].  With TAR 1.0 you know how much review you’ll have to do if you are going to review the docs that will potentially be produced, whereas you don’t with TAR 2.0 [though you could get a rough estimate with additional sampling].  Employees may utilize code words, and some people such as traders use unique lingo — will this cause problems for TAR?  It is useful to use unsupervised learning (clustering) to identify issues and keywords.  Negotiation over TAR use can sometimes be more work than doing the review without TAR.  It is hard to know the size of the benefit that TAR will provide for a project in advance, which can make it hard to convince people to use it.  Do you have to disclose the use of TAR to the other side?  If you are using it to cull, rather than just to prioritize the review, probably.  Courts will soon require or encourage the use of TAR.  There is a proportionality argument that it is unreasonable to not use it.  Data volumes are skyrocketing.  90% of the data in the world was created in the last 2 years.ig3east2019_talk

Is There Room for Governance in Digital Transformation?
I wasn’t able to attend this one.

Investigative Analytics and Machine Learning; The Right Mindset, Tools, and Approach can Make all the Difference
You can use e-discovery AI tools to get the investigation going.  Some people still use paper, and the meta data from the label on the box containing the documents may be all you have.  While keyword search may not be very effective, the query may be a starting point for communicating what the person is looking for so you can figure out how to find it.  Use clustering to look for outliers.  Pushing people to use tech just makes them hate you.  Teach them in a way that is relatable.  Listen to the people that are trying to learn and see what they need.  Admit that tech doesn’t always work.  Don’t start filtering the data down too early — you need to understand it first.  It is important to be able to predict things such as cost.  Figure out which people to look at first (tiering).  Convince people to try analytics by pointing out how it can save time so they can spend more time with their kids.  Tech vendors need to be honest about what their products can do (users need to be skeptical).

CCPA and New US Privacy Laws Readiness
I wasn’t able to attend this one.

Ick, Math! Ensuring Production Quality
I moderated this panel, so I didn’t take notes.

Effective Data Mapping Policies and Avoiding Pitfalls in GDPR and Data Transfers for Cross-Border Litigations and Investigations
I wasn’t able to attend this one.

Technology Solution Update From Corporate, Law Firm and Service Provider Perspective
I wasn’t able to attend this one.

Selecting eDiscovery Platforms and Vendors
People often pick services offered by their friends rather than doing an unbiased analysis.  Often do an RFI, then RFP, then POC to see what you really get out of the system.  Does the vendor have experience in your industry?  What is billable vs non-billable?  Are you paying for peer QC?  What does data in/out mean for billing?  Do a test run with the vendor before making any decisions for the long term.  Some vendors charge by the user, instead of, or in addition to, charging based on data volume.  What does “unlimited” really mean?  Government agencies tend to demand a particular way of pricing, and projects are usually 3-5 years.  Charging a lot for a large number of users working on a small database really annoys the customer.  Per-user fees are really a Relativity thing, and other platforms should not attempt it.  Firms will bring data in house to avoid user fees unless the data is too big (e.g., 10GB).  How do dupes impact billing?  Are they charging to extract a dupe?  Concurrent user licenses were annoying, so many moved to named user licenses (typically 4 or 5 to one).  Concurrent licenses may have a burst option to address surges in usage, perhaps setting to the new level.  Some people use TAR on all cases while others in the firm/company never use it, so keep that in mind when licensing it.  Forcing people to use an unfamiliar platform to save money can be a mistake since there may be a lot of effort required to learn it.

eDiscovery Support and Pricing Model — Do we have it all Wrong?
Various pricing models: data in/out + hosting + reviewers, based on number of custodians, or bulk rate (flat monthly fee).  Redaction, foreign language, and privilege logs used to be separate charges, but there is now pressure to include them in the base fee.  Some make processing free but compensate by raising the rate for review.  RFP / procurement is a terrible approach for ediscovery because you work with and need to like the vendor/team.  Ask others about their experience with the vendor, though there is now less variability in quality between the vendors.  Encourage the vendor to make suggestions and not just be an order-taker.  Law firms often blame the vendor when a privileged document is produced, and the lack of transparency about what really happened is frustrating.  The client needs good communication with both the law firm and the vendor.  Law firms shouldn’t offer ediscovery services unless they can do it as well as the vendors (law firms have a fiduciary duty).  ig3east2019_memorial

Still Looking for the Data
I wasn’t able to attend this one.

Recycling Your eDiscovery Data: How Managing Data Across Your Portfolio can Help to Reduce Wasteful Spending
I wasn’t able to attend this one.

Ready, Fire, Aim!  Negotiating Discovery Protocols
The Mandatory Initial Discovery Pilot Program in the Northern District of Illinois and Arizona requires production within 70 days from filing in order to motivate both sides to get going and cooperate.  One complaint about this is that people want a motion to dismiss to be heard before getting into ediscovery.  Can’t get away with saying “give us everything” under the pilot program since there is not enough time for that to be possible.  Nobody wants to be the unreasonable party under such a tight deadline.  The Commercial Division of the NY Supreme Court encourages categorical privilege logs.  You describe the category, say why it is privileged, and specify how many documents were redacted vs being withheld in their entirety.  Make a list of third parties that received the privileged documents (not a full list of all from/to).  It can be a pain to come up with a set of categories when there is a huge number of documents.  When it comes to TAR protocols, one might disclose the tool used or whether only the inclusive email was produced.  Should the seed set size or elusion set size be disclosed?  Why is the producing party disclosing any of this instead of just claiming that their only responsibility is to produce the documents?  Disclosing may reduce the risk of having a fight over sufficiency.  Government regulators will just tell you to give them everything exactly the way they want it.  When responding to a criminal antitrust investigation you can get in trouble if you standardize the timezone in the data.  Don’t do threading without consent.  A second request may require you to provide a list of all keywords in the collection and their frequencies.  Be careful about orders requiring you to produce the full family — this will compel you to produce non-responsive attachments.

Document Review Pricing Reset
A common approach is hourly pricing for everything (except hosting).  This may be attractive to the customer because other approaches require the vendor to take on risk that the labor will be more than expected and they will build that into the price.  If the customer doesn’t need predictable cost, they won’t want to pay (implicitly) for insurance against a cost overrun.  It is a choice between predictability of cost and lowest cost.  Occasionally review is priced on a per-document basis, but it is hard to estimate what the fair price is since data can vary.  Per-document pricing puts some pressure on the review team to better manage the process for efficiency.  Some clients are asking for a fixed price to handle everything for the next three years. ig3east2019_reflecting_pool A hybrid model has a fixed monthly fee with a lower hourly rate for review, with the lower hourly review making paying for extra QC review less painful.  Using separate vendors and review companies can have a downside if reviewers sit idle while the tech is not ready.  On the other hand, if there are problems with the reviewers it is nice to have the option to swap them out for another review team.

Finding Common Ground: Legal & IT Working Together
I wasn’t able to attend this one.

Highlights from EDRM Workshop 2019

The annual EDRM Workshop was held at Duke Law School edrm2019_buildingstarting on the evening of May 15th and ending at lunch time on the 17th.  It consisted of a mixture of panels, presentations, working group reports, and working sessions focused on various aspects of e-discovery.  I’ve provided some highlights below.  You can find my full set of photos here.

Herb Roitblat presented a paper on fear of missing out (FOMO).  If 80% recall is achieved, is it legitimate for the requesting party to be concerned about what may have been missed in the 20% of the responsive documents that weren’t produced, edrm2019_fomoor are the facts in that 20% duplicative of the facts found in the 80% that was produced?

A panel discussed the issues faced by in-house counsel.  Employees want to use the latest tools, but then you have to worry about how to collect the data (e.g., Skype video recordings).  How to preserve an iPhone?  What if the phone gets lost or stolen?  When doing TAR, can the classifier/model be moved between cases/clients?  New vendors need to be able to explain how they are unique, they need to get established (nobody wants to be on the cutting edge, and it’s hard to get a pilot going), and they should realize that it can take a year to get approval.  There are security/privacy problems with how law firms handle email.  ROI tracking is important.  Analytics is used heavily in investigations, and often in litigation, but they currently only use TAR for prioritization and QC, not to cull the population before review.  Some law firms are adverse to putting data in the cloud, but cloud providers may have better security than law firms.

The GDPR team is working on educating U.S. judges about GDPR and developing a code of conduct.  The EDRM reference will be made easier to update.  The AI group is focused on AI in legal (e.g., estimating recidivism, billing, etc.), not implications of AI for the law.  The TAR group’s paper is out.  The Privilege Logs group wants to avoid duplicating Sedona’s effort (sidenote: lawyers need to learn that an email is not priv just because a lawyer was CC’ed on it).  The Stop Words team is trying to educate people about things such as regular expressions, edrm2019_receptionand warned about cases where you want to search for a single letter or a term such as “AN” (for ammonium nitrate).  The Proportionality group talked about the possibility of having a standard set of documents that should be produced for certain types of cases and providing guidelines for making proportionality arguments to the court.

A panel of judges said that cybersecurity is currently a big issue.  Each court has it’s own approach to security.  Rule 16 conferences need to be taken seriously.  Judges don’t hire e-discovery vendors, so they don’t know costs.  How do you collect a proprietary database?  Lawyers can usually work it out without the judge.  There is good cooperation when the situations of the parties isn’t too asymmetric.  Attorneys need to be more specific in document requests and objections (no boilerplate).  edrm2019_judgesAttorneys should know the case better than the judge, and educate the judge in a way that makes the judge look good.  Know the client’s IT systems and be aware of any data migration efforts.  Stay up on technology (e.g., Slack and text messages).  Have a 502(d) order (some people object because they fear the judge will assume priv review is not needed, but the judges didn’t believe that would happen).  Protect confidential information that is exchanged (what if there is a breach?).   When filing under seal, “attorney’s eyes only” should be used very sparingly, and “confidential” is over used.

TAR vs. Keyword Search Challenge, Round 6 (Instant Feedback)

This was by far the most significant iteration of the ongoing exercise where I challenge an audience to produce a keyword search that works better than technology-assisted review (also known as predictive coding or supervised machine learning).  There were far more participants than previous rounds, and a structural change in the challenge allowed participants to get immediate feedback on the performance of their queries so they could iteratively improve them.  A total of 1,924 queries were submitted by 42 participants (an average of 45.8 queries per person) and higher recall levels were achieved than in any prior version of the challenge, but the audience still couldn’t beat TAR.

In previous versions of the experiment, the audience submitted search queries on paper or through a web form using their phones, and I evaluated a few of them live on stage to see whether the audience was able to achieve higher recall than TAR.  Because the number of live evaluations was so small, the audience had very little opportunity to use the results to improve their queries.  In the latest iteration, participants each had their own computer in the lab at the 2019 Ipro Tech Show, and the web form evaluated the query and gave the user feedback on the recall achieved immediately.  Furthermore, it displayed the relevance and important keywords for each of the top 100 documents matching the query, so participants could quickly discover useful new search terms to tweak their queries.  This gave participants a significant advantage over a normal e-discovery scenario, since they could try an unlimited number of queries without incurring any cost to make relevance determinations on the retrieved documents in order to decide which keywords would improve the queries.  The number of participants was significantly larger than any of the previous iterations, and they had a full 20 minutes to try as many queries as they wanted.  It was the best chance an audience has ever had of beating TAR.  They failed.

To do a fair comparison between TAR and the keyword search results, recall values were compared for equal amounts of document review effort.  In other words, for a specified amount of human labor, which approach gave the best production?  For the search queries, the top 3,000 documents matching the query were evaluated to determine the number that were relevant so recall could be computed (the full population was reviewed in advance, so the relevance of all documents was known). That was compared to the recall for a TAR 3.0 process where 200 cluster centers were reviewed for training and then the top-scoring 2,800 documents were reviewed.  If the system was allowed to continue learning while the top-scoring documents were reviewed, the result was called “TAR 3.0 CAL.”  If learning was terminated after review of the 200 cluster centers, the result was called “TAR 3.0 SAL.”  The process was repeated with review of 6,000 documents instead of 3,000 so you can see how much recall improves if you double the review effort.  Participants could choose to submit queries for any of three topics: biology, medical industry, or law.

The results below labeled “Avg Participant” are computed by finding the highest recall achieved by each participant and averaging those values together.  These are surely somewhat inflated values since one would probably not go through so many iterations of honing the queries in practice (especially since evaluating the efficacy of a query would normally involve considerable labor instead of being free and instantaneous), but I wanted to give the participants as much advantage as I could and including all of the queries instead of just the best ones would have biased the results to be too low due to people making mistakes or experimenting with bad queries just to explore the documents.  The results labeled “Best Participant” show the highest recall achieved by any participant (computed separately for Top 3,000 and Top 6,000, so they may be different queries).

Biology Recall
Top 3,000 Top 6,000
Avg Participant 54.5 69.5
Best Participant 66.0 83.2
TAR 3.0 SAL 72.5 91.0
TAR 3.0 CAL 75.5 93.0
Medical Recall
Top 3,000 Top 6,000
Avg Participant 38.5 51.8
Best Participant 46.8 64.0
TAR 3.0 SAL 67.3 83.7
TAR 3.0 CAL 80.7 88.5
Law Recall
Top 3,000 Top 6,000
Avg Participant 43.1 59.3
Best Participant 60.5 77.8
TAR 3.0 SAL 63.5 82.3
TAR 3.0 CAL 77.8 87.8

As you can see from the tables above, the best result for any participant never beat TAR (SAL or CAL) when there was an equal amount of document review performed.  Furthermore, the average participant result for Top 6,000 never beat the TAR results for Top 3,000, though the best participant result sometimes did, so TAR typically gives a better result even with half as much review effort expended.  The graphs below show the best results for each participant compared to TAR in blue.  The numbers in the legend are the ID numbers of the participants (the color for a particular participant is not consistent across topics).  Click the graph to see a larger version.

bar_graph_bio

bar_graph_medical

bar_graph_law

The large number of people attempting the biology topic was probably due to it being the default, and I illustrated how to use the software with that topic.

One might wonder whether the participants could have done better if they had more than 20 minutes to work on their queries.  The graphs below show the highest recall achieved by any participant as a function of time.  You can see that results improved rapidly during the first 10 minutes, but it became hard to make much additional progress beyond that point.  Also, over half of the audience continued to submit queries after the 20 minute contest, while I was giving the remainder of the presentation.  40% of the queries were submitted during the first 10 minutes, 40% were submitted during the second 10 minutes, and 20% were submitted while I was talking.  Since there were roughly the same number of queries submitted in the second 10 minutes as the first 10 minutes, but much less progress was made, I think it is safe to say that time was not a big factor in the results.

time_bio

time_medical

time_law

In summary, even with a large pool of participants, ample time, and the ability to hone search queries based on instant feedback, nobody was able to generate a better production than TAR when the same amount of review effort was expended.  It seems fair to say that keyword search often requires twice as much document review to achieve a production that is as good as what you would get TAR.

 

 

Highlights from Ipro Tech Show 2019

Ipro renamed their conference from Ipro Innovations to the Ipro Tech Show this year.  As always, it was held at the Talking Stick Resortipro2019_ceo in Arizona and it was very well organized.  It started with a reception on April 29th that was followed by two days of talks.  There were also training days bookending the conference on April 29th and May 2nd.  After the keynote on Tuesday morning, there were five simultaneous tracks for the remainder of the conference, including a lot of hands-on work in computer labs.  I was only able to attend a few of the talks, but I’ve included my notes below. You can find my full set of photos here.  Videos and slides from the presentations are available here.

Dean Brown, who has been Ipro’s CEO for eight months, opened the conference with some information about himself and where the company is headed.  He mentioned that the largest case in a single Ipro database so far was 11 petabytes from 400 million documents.  Q1 2019 was the best quarter in the company’s history, and they had a 98% retention rate.  They’ve doubled spending on development and other departments.

Next, there was a panel where three industry experts discussed artificial intelligence.   AI can be used to analyze legal bills to determine which charges are reasonable.  Google uses AI to monitor and prohibit behaviors within the company, such as stopping your account from being used to do things when you are supposed to be away.  Only about 5% of the audience said they were using TAR.  It was hypothesized that this is due to FRCP 26(g)’s requirement to certify the production as complete and correct.  Many people use Slack instead of e-mail, and dealing with that is an issue for e-discovery.  CLOC was mentioned as an organization helping corporations get a handle on legal spending.ipro2019_lunch

The keynote was given by Kevin Surace, and mostly focused on AI.  You need good data and have to be careful about spurious correlations in the data (he showed various examples that were similar to what you find here).  An AI can watch a video and supplement it with text explaining what the person in the video is doing.  One must be careful about fast changing patterns and black swan events where there is no data available to model.  Doctors are being replaced by software that is better informed about the most recent medical research.  AI can review an NDA faster and more accurately than an attorney.  There is now a news channel in China using an AI news anchor instead of a human to deliver the news.  With autonomous vehicles, transportation will become free (supported by ads in the vehicle).  AI will have an impact 100 times larger than the Internet.ipro2019_juggle

I gave a talk titled “Technology: The Cutting Edge and Where We’re Headed” that focused on AI.  I started by showing the audience five pairs of images from WhichFaceIsReal.com and challenged them to determine which face was real and which was generated by an AI.  When I asked if anyone got all five right, I only saw one person raise their hand.  When I asked if anyone got all five wrong, I saw three hands go up.  Admittedly, I picked image pairs that I thought were particularly difficult, but the result is still a little scary.ipro2019_hotel

I also gave a talk titled “TAR Versus Keyword Challenge” where I challenged the audience to construct a keyword search that worked better than technology-assisted review.  The format of this exercise was very different from previous iterations, making it easy for participants to test and hone their queries.  We had 1,924 queries submitted by 42 participants.  They achieved the highest recall levels seen so far, but still couldn’t beat TAR.  A detailed analysis is available here.

Misleading Metrics and Irrelevant Research (Accuracy and F1)

If one algorithm achieved 98.2% accuracy while another had 98.6% for the same task, would you be surprised to find that the first algorithm required ten times as much document review to reach 75% recall compared to the second algorithm?  This article explains why some performance metrics don’t give an accurate view of performance for ediscovery purposes, and why that makes a lot of research utilizing such metrics irrelevant for ediscovery.

The key performance metrics for ediscovery are precision and recall.  Recall, R, is the percentage of all relevant documents that have been found.  High recall is critical to defensibility.  Precision, P, is the percentage of documents predicted to be relevant that actually are relevant.  High precision is desirable to avoid wasting time reviewing non-relevant documents (if documents will be reviewed to confirm relevance and check for privilege before production).  In other words, precision is related to cost.  Specifically, 1/P is the average number of documents you’ll have to review per relevant document found.  When using technology-assisted review (predictive coding), documents can be sorted by relevance score and you can choose any point in the sorted list and compute the recall and precision that would be achieved by treating documents above that point as being predicted to be relevant.  One can plot a precision-recall curve by doing precision and recall calculations at various points in the sorted document list.

The precision-recall curve to the rightknn_precision compares two different classification algorithms applied to the same task.  To do a sensible comparison, we should compare precision values at the same level of recall.  In other words, we should compare the cost of reaching equally good (same recall) productions.  Furthermore, the recall level where the algorithms are compared should be one that is sensible for for ediscovery — achieving high precision at a recall level a court wouldn’t accept isn’t very useful.  If we compare the two algorithms at R=75%, 1-NN has P=6.6% and 40-NN has P=70.4%.  In other words, if you sort by relevance score with the two algorithms and review documents from top down until 75% of the relevant documents are found, you would review 15.2 documents per relevant document found with 1-NN and 1.4 documents per relevant document found with 40-NN.  The 1-NN algorithm would require over ten times as much document review as 40-NN.  1-NN has been used in some popular TAR systems.  I explained why it performs so badly in a previous article.

There are many other performance metrics, but they can be written as a mixture of precision and recall (see Chapter 7 of the current draft of my book).  Anything that is a mixture of precision and recall should raise an eyebrow — how can you mix together two fundamentally different things (defensibility and cost) into a single number and get a useful result?  Such metrics imply a trade-off between defensibility and cost that is not based on reality.  Research papers that aren’t focused on ediscovery often use such performance measures and compare algorithms without worrying about whether they are achieving the same recall, or whether the recall is high enough to be considered sufficient for ediscovery.  Thus, many conclusions about algorithm effectiveness simply aren’t applicable for ediscovery because they aren’t based on relevant metrics.

One popular metric is accuracy, knn_accuracywhich is the percentage of predictions that are correct.  If a system predicts that none of the documents are relevant and prevalence is 10% (meaning 10% of the documents are relevant), it will have 90% accuracy because its predictions were correct for all of the non-relevant documents.  If prevalence is 1%, a system that predicts none of the documents are relevant achieves 99% accuracy.  Such incredibly high numbers for algorithms that fail to find anything!  When prevalence is low, as it often is in ediscovery, accuracy makes everything look like it performs well, including algorithms like 1-NN that can be a disaster at high recall.  The graph to the right shows the accuracy-recall curve that corresponds to the earlier precision-recall curve (prevalence is 2.633% in this case), showing that it is easy to achieve high accuracy with a poor algorithm by evaluating it at a low recall level that would not be acceptable for ediscovery.  The maximum accuracy achieved by 1-NN in this case was 98.2% and the max for 40-NN was 98.6%.  In case you are curious, the relationship between accuracy, precision, and recall is:
ACC = 1 - \rho (1 - R) - \rho R (1 - P) / P
where \rho is the prevalence.

Another popular metric is the F1 score.knn_f1  I’ve criticized its use in ediscovery before.  The relationship to precision and recall is:
F_1 = 2 P R / (P + R)
The F1 score lies between the precision and the recall, and is closer to the smaller of the two.  As far as F1 is concerned, 30% recall with 90% precision is just as good as 90% recall with 30% precision (both give F1 = 0.45) even though the former probably wouldn’t be accepted by a court and the latter would.   F1 cannot be large at small recall, unlike accuracy, but it can be moderately high at modest recall, making it possible to achieve a decent F1 score even if performance is disastrously bad at the high recall levels demanded by ediscovery.  The graph to the right shows that 1-NN manages to achieve a maximum F1 of 0.64, which seems pretty good compared to the 0.73 achieved by 40-NN, giving no hint that 1-NN requires ten times as much review to achieve 75% recall in this example.

Hopefully this article has convinced you that it is important for research papers to use the right metric, specifically precision (or review effort) at high recall, when making algorithm comparisons that are useful for ediscovery.

TAR vs. Keyword Search Challenge, Round 5

The audience was challenged to construct a keyword search query that is more effective than technology-assisted review (TAR) at IG3 West 2018.  The procedure was the same as round 4, so I won’t repeat the details here.  The audience was small this time and we only got one query submission for each topic.  The submission for the law topic used AND to join the keywords together and matched no articles, so I changed the ANDs to ORs before evaluating it.  The results and queries are below.  TAR beat the keyword searches by a huge margin this time.

Biology Recall
Query Top 3,000 Top 6,000
Search 20.1% 20.1%
TAR 3.0 SAL 72.5% 91.0%
TAR 3.0 CAL 75.5% 93.0%
Medical Recall
Query Top 3,000 Top 6,000
Search 28.5% 38.1%
TAR 3.0 SAL 67.3% 83.7%
TAR 3.0 CAL 80.7% 88.5%
Law Recall
Query Top 3,000 Top 6,000
Search 5.5% 9.4%
TAR 3.0 SAL 63.5% 82.3%
TAR 3.0 CAL 77.8% 87.8%

tar_vs_search5_biology

tar_vs_search5_medical

tar_vs_search5_law

biology query: (Evolution OR develop) AND (Darwin OR bird OR cell)
medical query: Human OR body OR medicine OR insurance OR license OR doctor OR patient
law query: securities OR conspiracy OR RICO OR insider

Highlights from IG3 West 2018

The IG3 West conference was held by Ing3nious at the Paséa Hotel & Spa in Huntington Beach, California. ig3west2018_hotel This conference differed from other recent Ing3nious events in several ways.  It was two days of presentations instead of one.  There were three simultaneous panels instead of two.  Between panels there were sometimes three simultaneous vendor technology demos.  There was an exhibit hall with over forty vendor tables.  Due to the different format, I was only able to attend about a third of the presentations.  My notes are below.  You can find my full set of photos here.

Stop Chasing Horses, Start Building Fences: How Real-Time Technologies Change the Game of Compliance and Governance
Chris Surdak, the author of Jerk:  Twelve Steps to Rule the World, talked about changing technology and the value of information, claiming that information is the new wealth.  Facebook, Amazon, Apple, Netflix, and Google together are worth more than France [apparently he means the sum of their market capitalizations  is greater than the GDP of France, though that is a rather apples-to-oranges comparison since GDP is an annualized number].  We are exposed to persistent ambient surveillance (Alexa, Siri, Progressive Snapshot, etc.).  It is possible to detect whether someone is lying by using video to detect blood flow to their face.  Car companies monetized data about passengers’ weight (measured due to air bags). ig3west2018_keynote Sentiment analysis has a hard time with sarcasm.  You can’t find emails about fraud by searching for “fraud” — discussions about fraudulent activity may be disguised as weirdly specific conversations about lunch.  The problem with graph analysis is that a large volume of talk about something doesn’t mean that it’s important.  The most important thing may be what’s missing.  When RadioShack went bankrupt, its remaining value was in its customer data — remember them asking for your contact info when you bought batteries?  A one-word change to FRCP 37(e) should have changed corporate retention policies, but nobody changed.  The EU’s right to be forgotten is virtually impossible to implement in reality (how to deal with backup tapes?) and almost nobody does it.  Campbell’s has people shipping their DNA to them so they can make diet recommendations to them.  With the GDPR, consent nullifies the protections, so it doesn’t really protect your privacy.

AI and the Corporate Law Department of the Future
Gartner says AI is at the peak of inflated expectations and a trough of disillusionment will follow.  Expect to be able to buy autonomous vehicles by 2023.  The economic downturn of 2008 caused law firms to start using metrics.  Legal will take a long time to adopt AI — managing partners still have assistants print stuff out.  Embracing AI puts a firm ahead of its competitors.  Ethical obligations are also an impediment to adoption of technology, since lawyers are concerned about understanding the result.

Advanced TAR Considerations: A 500 Level Crash Course
Continuous Active Learning (CAL), also called TAR 2.0, can adapt to shifts in the concept of relevance that may occur during the review.  There doesn’t seem to be much difference in the efficiency of SVM vs logistic regression when they are applied to the same task.  There can be a big efficiency difference between different tasks.  TAR 1.0 requires a subject-matter expert for training, but senior attorneys are not always readily available.  With TAR 1.0 you may be concerned that you will be required to disclose the training set (including non-responsive documents), but with TAR 2.0 there is case law that supports that being unnecessary [I’ve seen the argument that the production itself is the training set, but that neglects the non-responsive documents that were reviewed (and used for training) but not produced.  On the other hand, if you are taking about disclosing just the seed set that was used to start the process, that can be a single document and it has very little impact on the result.].  Case law can be found at predictivecoding.com, which is updated at the end of each year.  TAR needs text, not image data.  Sometimes keywords are good enough.  When it comes to government investigations, many agencies (FTC, DOJ) use/accept TAR.  It really depends on the individual investigator, though, and you can’t fight their decision (the investigator is the judge).  Don’t use TAR for government investigations without disclosing that you are doing so.  TAR can have trouble if there are documents having high conceptual similarity where some are relevant and some aren’t.  Should you tell opposing counsel that you’re using TAR?  Usually, but it depends on the situation.  When the situation is symmetrical, both sides tend to be reasonable.  When it is asymmetrical, the side with very little data may try to make things expensive for the other side, so say something like “both sides may use advanced technology to produce documents” and don’t give more detail than that (e.g., how TAR will be trained, who will do the training, etc.) or you may invite problems.  Disclosing the use of TAR up front and getting agreement may avoid problems later.  Be careful about “untrainable documents” (documents containing too little text) — separate them out, and maybe use meta data or file type to help analyze them.  Elusion testing can be used to make sure too many relevant documents weren’t missed.  One panelist said 384 documents could be sampled from the elusion set, though that may sometimes not be enough.  [I have to eat some crow here.  I raised my hand and pointed out that the margin of error for the elusion has to be divided by the prevalence to get the margin of error for the recall, which is correct.  I went on to say that with a sample of 384 giving ±5% for the elusion you would have ±50% for the recall if prevalence was 10%, making the measurement worthless.  The mistake is that while a sample of 384 technically implies a worst case of ±5% for the margin of error for elusion, it’s not realistic for the margin of error to be that bad for elusion because ±5% would occur if elusion was near 50%, but elusion is typically very small (smaller than the prevalence), causing the margin of error for the elusion to be significantly less than ±5%.  The correct margin of error for the recall from an elusion sample of 384 documents would be ±13% if the prevalence is 10%, and ±40% if the prevalence is 1%.  So, if prevalence is around 10% an elusion sample of 384 isn’t completely worthless (though it is much worse than the ±5% we usually aim for), but if prevalence is much lower than that it would be].

40 Years in 30 Minutes: The Background to Some of the Interesting Issues we Face
Steven Brower talked about the early days of the Internet and the current state of technology. ig3west2018_reception1 Early on, a user ID was used to tell who you were, not to keep you out.  Technology was elitist, and user-friendly was not a goal.  Now, so much is locked down for security reasons that things become unusable.  Law firms that prohibit access to social media force lawyers onto “secret” computers when a client needs something taken down from YouTube.  Emails about laws against certain things can be blocked due to keyword hits for the illegal things being described.  We don’t have real AI yet.  The next generation beyond predictive coding will be able to identify the 50 key documents for the case.  During e-discovery, try searching for obscenities to find things like: “I don’t give a f*** what the contract says.”  Autonomous vehicles won’t come as soon as people are predicting.  Snow is a problem for them.  We may get vehicles that drive autonomously from one parking lot to another, so the route is well known.  When there are a bunch of inebriated people in the car, who should it take commands from?  GDPR is silly since email bounces from computer to computer around the world.  The Starwood breach does not mean you need to get a new passport — your passport number was already out there.  To improve your security, don’t try to educate everyone about cybersecurity — you can eliminate half the risk by getting payroll to stop responding to emails asking for W2 data that appear to come from the CEO.  Scammers use the W2 data to file tax returns to get the refunds.  This is so common the IRS won’t even accept reports on it anymore.  You will still get your refund if it happens to you, but it’s a hassle.

Digging Into TAR
I moderated this panel, so I didn’t take notes.  We did the TAR vs. Keyword Search Challenge again.  The results are available here.

After the Incident: Investigating and Responding to a Data Breach
Plan in advance, and remember that you may not have access to the laptop containing the plan when there is a breach. Get a PR firm that handles crises in advance.  You need to be ready for the negative comments on Twitter and Facebook.  Have the right SMEs for the incident on the team.  Assume that everything is discoverable — attorney-client privilege won’t save you if you ask the attorney for business (rather than legal) advice.  Notification laws vary from state to state.  An investigation by law enforcement may require not notifying the public for some period of time.  You should do an annual review of your cyber insurance since things are changing rapidly.  Such policies are industry specific.

Employing Technology/Next-Gen Tools to Reduce eDiscovery Spend
Have a process, but also think about what you are doing and the specifics of the case.  Restrict the date range if possible.  Reuse the results when you have overlapping cases (e.g., privilege review).  Don’t just look at docs/hour when monitoring the review.  Look at accuracy and get feedback about what they are finding.  CAL tends to result in doing too much document review (want to stop at 75% recall but end up hitting 89%).  Using a tool to do redactions will give false positives, so you need manual QC of the result.  When replacing a patient ID with a consistent anonymized identifier, you can’t just transform the ID because that could be inverted, resulting in a HIPAA violation.

eDiscovery for the Rest of us
What are ediscovery considerations for relatively small data sets?  During meet and confer, try to cooperate.  Judges hate ediscovery disputes.  Let the paralegals hash out the details — attorneys don’t really care about the details as long as it works.  Remote collection can avoid travel costs and hourly fees while keeping strangers out of the client’s office.  The biggest thing they look for from vendors is cost.  Need a certain volume of data for TAR to be practical.  Email threading can be used at any size.

Does Compliance Stifle or Spark Innovation?
Startups tend to be full of people fleeing big corporations to get away from compliance requirements. ig3west2018_reception2 If you do compliance well, that can be an advantage over competitors.  Look at it as protecting the longevity of the business (protecting reputation, etc.).  At the DoD, compliance stifles innovation, but it creates a barrier against bad guys.  They have thousands of attacks per day and are about 8 years behind normal innovation.  Gray crimes are a area for innovation — examples include manipulation (influencing elections) and tanking a stock IPO by faking a poisoning.  Hospitals and law firms tend to pay, so they are prime targets for ransomware.

Panels That I Couldn’t Attend:
California and EU Privacy Compliance
What it all Comes Down to – Enterprise Cybersecurity Governance
Selecting eDiscovery Platforms and Vendors
Defensible Disposition of Data
Biometrics and the Evolving Legal Landscape
Storytelling in the Age of eDiscovery
Technology Solution Update From Corporate, Law Firm and Service Provider Perspective
The Internet of Things and Everything as a Service – the Convergence of Security, Privacy and Product Liability
Similarities and Differences Between the GDPR and the New California Consumer Privacy Act – Similar Enough?
The Impact of the Internet of Things on eDiscovery
Escalating Cyber Risk From the IT Department to the Boardroom
So you Weren’t Quite Ready for GDPR?
Security vs. Compliance and Why Legal Frameworks Fall Short to Improve Information Security
How to Clean up Files for Governance and GDPR
Deception, Active Defense and Offensive Security…How to Fight Back Without Breaking the Law?
Information Governance – Separating the “Junk” from the “Jewels”
What are Big Law Firms Saying About Their LegalTech Adoption Opportunities and Challenges?
Cyber and Data Security for the GC: How to Stay out of Headlines and Crosshairs

Highlights from Text Analytics Forum 2018

Text Analytics Forum is part of KMWorld.  It was held on November 7-8 at the JW Marriott in D.C..  Attendees went to the large KMWorld keynotes in the morning and had two parallel text analytics tracks for the remainder of the day.  There was a technical track and an applications track.  Most of the slides are available here.  My photos, including photos of some slides that caught my attention or were not available on the website, are available here.  Since most slides are available online, I have only a few brief highlights below.  Next year’s KMWorld will be November 5-7, 2019.

The Think Creatively & Make Better Decisions keynote contained various interesting facts about the things that distract us and make us unproductive.  kmworld2018_treasuryDistracted driving causes more deaths than drunk driving.  Attention spans have dropped from 12 seconds to 8 seconds (goldfish have a 9-second attention span).  Japan has texting lanes for walking.  71% of business meetings are unproductive, and 33% of employee time is spent in meetings. 281 billion emails were sent in 2018.  Don’t leave ideas and creative thinking to the few.  Mistakes shouldn’t be reprimanded.  Break down silos between departments.

The Deep Text Look at Text Analytics keynote explained that text mining is only part of text analytics.  Text mining treats words as things, whereas text analytics cares about meaning.  Sentiment analysis is now learning to handle things like: “I would have loved your product except it gave me a headache.”  It is hard for hukmworld2018_keynotemans to pick good training documents for automatic categorization systems (what the e-discovery world calls predictive coding or technology-assisted review).  Computer-generated taxonomies are incredibly bad.  Deep learning is not like what humans do.  Deep learning takes 100,000 examples to detect a pattern, whereas humans will generalize (perhaps wrongly) from 2 examples.

The Cognitive Computing keynote mentioned that sarcasm makes sentiment analysis difficult.  For example: “I’m happy to spend a half hour of my lunch time in line at your bank.”  There are products to measure tone from audio and video.

The Don’t Stop at Stopwords: Function Words in Text Analytics sessionkmworld2018_washington_monument noted that function words, unlike content words, are added by the writer subconsciously.  Use of words like “that” or “the” instead of “this” can indicate the author is distancing himself/herself from the thing being described, possibly indicating deception.  They’ve used their techniques in about 20 different languages.  They need at least 300 words to make use of function word frequency to build a baseline.

The Should We Consign All Taxonomies to the Dustbin? talk considered the possibility of using machine learning to go directly from problem to solution without having a taxonomy in between.  He said that 100k documents or 1 million words of text are needed to get going.