Category Archives: Uncategorized

Highlights from Text Analytics Forum 2019

Text Analytics Forum is part of the KMWorld conference. taf2019_hallIt was held on November 6-7 at the JW Marriott in D.C.. Attendees went to the large KMWorld keynotes in the morning and had two parallel text analytics tracks for the remainder of the day. There was a technical track and an applications track. Most of the slides are available here. My photos, including photos of some slides that caught my attention or were not available on the website, are available here. Since most slides are available online, I have only a few brief highlights below.

Automatic summarization comes in two forms: extracted and generative.  Generative summarization doesn’t work very well, and some products are dropping the feature.  Enron emails containing lies tend to be shorter.  When a customer threatens to cancel a service, the language they use may indicate they are really looking to bargain.  Deep learning works well with data, but not with concepts.  For good results, make use of all document structure (titles, boldface, etc.) — search engines often ignore such details.  Keywords assigned to a document by a human are often unreliable or inconsistent.  Having the document’s author write a summary may be more useful.  Rules work better when there is little content (machine learning prefers more content).  Knowledge graphs, which were a major topic at the conference, are better for discovery than for search.tar2019_monument

DBpedia provides structured data from wikipedia for knowledge graphs.  SPARQL is a standardized language for graph databases similar to SQL for relational databases.  When using knowledge graphs, the more connections away the answer is, the more like it is to be wrong.  Knowledge graphs should always start with a good taxonomy or ontology.

Social media text (e.g., tweets) contains a lot of noise.  Some software handles both social media and normal text, but some only really works with one or the other.  Sentiment analysis can be tripped when only looking at keywords.  For example, consider “product worked terribly” to “I’m terribly happy with the product.”  Humans are only 60-80% accurate at sentiment analysis.

Highlights from IG3 West 2018

The IG3 West conference was held by Ing3nious at the Paséa Hotel & Spa in Huntington Beach, California. ig3west2018_hotel This conference differed from other recent Ing3nious events in several ways.  It was two days of presentations instead of one.  There were three simultaneous panels instead of two.  Between panels there were sometimes three simultaneous vendor technology demos.  There was an exhibit hall with over forty vendor tables.  Due to the different format, I was only able to attend about a third of the presentations.  My notes are below.  You can find my full set of photos here.

Stop Chasing Horses, Start Building Fences: How Real-Time Technologies Change the Game of Compliance and Governance
Chris Surdak, the author of Jerk:  Twelve Steps to Rule the World, talked about changing technology and the value of information, claiming that information is the new wealth.  Facebook, Amazon, Apple, Netflix, and Google together are worth more than France [apparently he means the sum of their market capitalizations  is greater than the GDP of France, though that is a rather apples-to-oranges comparison since GDP is an annualized number].  We are exposed to persistent ambient surveillance (Alexa, Siri, Progressive Snapshot, etc.).  It is possible to detect whether someone is lying by using video to detect blood flow to their face.  Car companies monetized data about passengers’ weight (measured due to air bags). ig3west2018_keynote Sentiment analysis has a hard time with sarcasm.  You can’t find emails about fraud by searching for “fraud” — discussions about fraudulent activity may be disguised as weirdly specific conversations about lunch.  The problem with graph analysis is that a large volume of talk about something doesn’t mean that it’s important.  The most important thing may be what’s missing.  When RadioShack went bankrupt, its remaining value was in its customer data — remember them asking for your contact info when you bought batteries?  A one-word change to FRCP 37(e) should have changed corporate retention policies, but nobody changed.  The EU’s right to be forgotten is virtually impossible to implement in reality (how to deal with backup tapes?) and almost nobody does it.  Campbell’s has people shipping their DNA to them so they can make diet recommendations to them.  With the GDPR, consent nullifies the protections, so it doesn’t really protect your privacy.

AI and the Corporate Law Department of the Future
Gartner says AI is at the peak of inflated expectations and a trough of disillusionment will follow.  Expect to be able to buy autonomous vehicles by 2023.  The economic downturn of 2008 caused law firms to start using metrics.  Legal will take a long time to adopt AI — managing partners still have assistants print stuff out.  Embracing AI puts a firm ahead of its competitors.  Ethical obligations are also an impediment to adoption of technology, since lawyers are concerned about understanding the result.

Advanced TAR Considerations: A 500 Level Crash Course
Continuous Active Learning (CAL), also called TAR 2.0, can adapt to shifts in the concept of relevance that may occur during the review.  There doesn’t seem to be much difference in the efficiency of SVM vs logistic regression when they are applied to the same task.  There can be a big efficiency difference between different tasks.  TAR 1.0 requires a subject-matter expert for training, but senior attorneys are not always readily available.  With TAR 1.0 you may be concerned that you will be required to disclose the training set (including non-responsive documents), but with TAR 2.0 there is case law that supports that being unnecessary [I’ve seen the argument that the production itself is the training set, but that neglects the non-responsive documents that were reviewed (and used for training) but not produced.  On the other hand, if you are taking about disclosing just the seed set that was used to start the process, that can be a single document and it has very little impact on the result.].  Case law can be found at, which is updated at the end of each year.  TAR needs text, not image data.  Sometimes keywords are good enough.  When it comes to government investigations, many agencies (FTC, DOJ) use/accept TAR.  It really depends on the individual investigator, though, and you can’t fight their decision (the investigator is the judge).  Don’t use TAR for government investigations without disclosing that you are doing so.  TAR can have trouble if there are documents having high conceptual similarity where some are relevant and some aren’t.  Should you tell opposing counsel that you’re using TAR?  Usually, but it depends on the situation.  When the situation is symmetrical, both sides tend to be reasonable.  When it is asymmetrical, the side with very little data may try to make things expensive for the other side, so say something like “both sides may use advanced technology to produce documents” and don’t give more detail than that (e.g., how TAR will be trained, who will do the training, etc.) or you may invite problems.  Disclosing the use of TAR up front and getting agreement may avoid problems later.  Be careful about “untrainable documents” (documents containing too little text) — separate them out, and maybe use meta data or file type to help analyze them.  Elusion testing can be used to make sure too many relevant documents weren’t missed.  One panelist said 384 documents could be sampled from the elusion set, though that may sometimes not be enough.  [I have to eat some crow here.  I raised my hand and pointed out that the margin of error for the elusion has to be divided by the prevalence to get the margin of error for the recall, which is correct.  I went on to say that with a sample of 384 giving ±5% for the elusion you would have ±50% for the recall if prevalence was 10%, making the measurement worthless.  The mistake is that while a sample of 384 technically implies a worst case of ±5% for the margin of error for elusion, it’s not realistic for the margin of error to be that bad for elusion because ±5% would occur if elusion was near 50%, but elusion is typically very small (smaller than the prevalence), causing the margin of error for the elusion to be significantly less than ±5%.  The correct margin of error for the recall from an elusion sample of 384 documents would be ±13% if the prevalence is 10%, and ±40% if the prevalence is 1%.  So, if prevalence is around 10% an elusion sample of 384 isn’t completely worthless (though it is much worse than the ±5% we usually aim for), but if prevalence is much lower than that it would be].

40 Years in 30 Minutes: The Background to Some of the Interesting Issues we Face
Steven Brower talked about the early days of the Internet and the current state of technology. ig3west2018_reception1 Early on, a user ID was used to tell who you were, not to keep you out.  Technology was elitist, and user-friendly was not a goal.  Now, so much is locked down for security reasons that things become unusable.  Law firms that prohibit access to social media force lawyers onto “secret” computers when a client needs something taken down from YouTube.  Emails about laws against certain things can be blocked due to keyword hits for the illegal things being described.  We don’t have real AI yet.  The next generation beyond predictive coding will be able to identify the 50 key documents for the case.  During e-discovery, try searching for obscenities to find things like: “I don’t give a f*** what the contract says.”  Autonomous vehicles won’t come as soon as people are predicting.  Snow is a problem for them.  We may get vehicles that drive autonomously from one parking lot to another, so the route is well known.  When there are a bunch of inebriated people in the car, who should it take commands from?  GDPR is silly since email bounces from computer to computer around the world.  The Starwood breach does not mean you need to get a new passport — your passport number was already out there.  To improve your security, don’t try to educate everyone about cybersecurity — you can eliminate half the risk by getting payroll to stop responding to emails asking for W2 data that appear to come from the CEO.  Scammers use the W2 data to file tax returns to get the refunds.  This is so common the IRS won’t even accept reports on it anymore.  You will still get your refund if it happens to you, but it’s a hassle.

Digging Into TAR
I moderated this panel, so I didn’t take notes.  We did the TAR vs. Keyword Search Challenge again.  The results are available here.

After the Incident: Investigating and Responding to a Data Breach
Plan in advance, and remember that you may not have access to the laptop containing the plan when there is a breach. Get a PR firm that handles crises in advance.  You need to be ready for the negative comments on Twitter and Facebook.  Have the right SMEs for the incident on the team.  Assume that everything is discoverable — attorney-client privilege won’t save you if you ask the attorney for business (rather than legal) advice.  Notification laws vary from state to state.  An investigation by law enforcement may require not notifying the public for some period of time.  You should do an annual review of your cyber insurance since things are changing rapidly.  Such policies are industry specific.

Employing Technology/Next-Gen Tools to Reduce eDiscovery Spend
Have a process, but also think about what you are doing and the specifics of the case.  Restrict the date range if possible.  Reuse the results when you have overlapping cases (e.g., privilege review).  Don’t just look at docs/hour when monitoring the review.  Look at accuracy and get feedback about what they are finding.  CAL tends to result in doing too much document review (want to stop at 75% recall but end up hitting 89%).  Using a tool to do redactions will give false positives, so you need manual QC of the result.  When replacing a patient ID with a consistent anonymized identifier, you can’t just transform the ID because that could be inverted, resulting in a HIPAA violation.

eDiscovery for the Rest of us
What are ediscovery considerations for relatively small data sets?  During meet and confer, try to cooperate.  Judges hate ediscovery disputes.  Let the paralegals hash out the details — attorneys don’t really care about the details as long as it works.  Remote collection can avoid travel costs and hourly fees while keeping strangers out of the client’s office.  The biggest thing they look for from vendors is cost.  Need a certain volume of data for TAR to be practical.  Email threading can be used at any size.

Does Compliance Stifle or Spark Innovation?
Startups tend to be full of people fleeing big corporations to get away from compliance requirements. ig3west2018_reception2 If you do compliance well, that can be an advantage over competitors.  Look at it as protecting the longevity of the business (protecting reputation, etc.).  At the DoD, compliance stifles innovation, but it creates a barrier against bad guys.  They have thousands of attacks per day and are about 8 years behind normal innovation.  Gray crimes are a area for innovation — examples include manipulation (influencing elections) and tanking a stock IPO by faking a poisoning.  Hospitals and law firms tend to pay, so they are prime targets for ransomware.

Panels That I Couldn’t Attend:
California and EU Privacy Compliance
What it all Comes Down to – Enterprise Cybersecurity Governance
Selecting eDiscovery Platforms and Vendors
Defensible Disposition of Data
Biometrics and the Evolving Legal Landscape
Storytelling in the Age of eDiscovery
Technology Solution Update From Corporate, Law Firm and Service Provider Perspective
The Internet of Things and Everything as a Service – the Convergence of Security, Privacy and Product Liability
Similarities and Differences Between the GDPR and the New California Consumer Privacy Act – Similar Enough?
The Impact of the Internet of Things on eDiscovery
Escalating Cyber Risk From the IT Department to the Boardroom
So you Weren’t Quite Ready for GDPR?
Security vs. Compliance and Why Legal Frameworks Fall Short to Improve Information Security
How to Clean up Files for Governance and GDPR
Deception, Active Defense and Offensive Security…How to Fight Back Without Breaking the Law?
Information Governance – Separating the “Junk” from the “Jewels”
What are Big Law Firms Saying About Their LegalTech Adoption Opportunities and Challenges?
Cyber and Data Security for the GC: How to Stay out of Headlines and Crosshairs

Highlights from Text Analytics Forum 2018

Text Analytics Forum is part of KMWorld.  It was held on November 7-8 at the JW Marriott in D.C..  Attendees went to the large KMWorld keynotes in the morning and had two parallel text analytics tracks for the remainder of the day.  There was a technical track and an applications track.  Most of the slides are available here.  My photos, including photos of some slides that caught my attention or were not available on the website, are available here.  Since most slides are available online, I have only a few brief highlights below.  Next year’s KMWorld will be November 5-7, 2019.

The Think Creatively & Make Better Decisions keynote contained various interesting facts about the things that distract us and make us unproductive.  kmworld2018_treasuryDistracted driving causes more deaths than drunk driving.  Attention spans have dropped from 12 seconds to 8 seconds (goldfish have a 9-second attention span).  Japan has texting lanes for walking.  71% of business meetings are unproductive, and 33% of employee time is spent in meetings. 281 billion emails were sent in 2018.  Don’t leave ideas and creative thinking to the few.  Mistakes shouldn’t be reprimanded.  Break down silos between departments.

The Deep Text Look at Text Analytics keynote explained that text mining is only part of text analytics.  Text mining treats words as things, whereas text analytics cares about meaning.  Sentiment analysis is now learning to handle things like: “I would have loved your product except it gave me a headache.”  It is hard for hukmworld2018_keynotemans to pick good training documents for automatic categorization systems (what the e-discovery world calls predictive coding or technology-assisted review).  Computer-generated taxonomies are incredibly bad.  Deep learning is not like what humans do.  Deep learning takes 100,000 examples to detect a pattern, whereas humans will generalize (perhaps wrongly) from 2 examples.

The Cognitive Computing keynote mentioned that sarcasm makes sentiment analysis difficult.  For example: “I’m happy to spend a half hour of my lunch time in line at your bank.”  There are products to measure tone from audio and video.

The Don’t Stop at Stopwords: Function Words in Text Analytics sessionkmworld2018_washington_monument noted that function words, unlike content words, are added by the writer subconsciously.  Use of words like “that” or “the” instead of “this” can indicate the author is distancing himself/herself from the thing being described, possibly indicating deception.  They’ve used their techniques in about 20 different languages.  They need at least 300 words to make use of function word frequency to build a baseline.

The Should We Consign All Taxonomies to the Dustbin? talk considered the possibility of using machine learning to go directly from problem to solution without having a taxonomy in between.  He said that 100k documents or 1 million words of text are needed to get going.

Best Legal Blog Contest 2018

From a field of hundreds of potential nominees, the Clustify Blog received enough nominations to be selected to compete in The Expert Institute’s Best Legal Blog Contest in the Legal Tech category. best_legal_blog_nominee_2018

Now that the blogs have been nominated and placed into categories, it is up to readers to select the very best.  Each blog will compete for rank within its category, with the three blogs receiving the most votes in each category being crowned overall winners.  A reader can vote for as many blogs as he/she wants in each category, but can vote for a specific blog only once (this is enforced by requiring authentication with Google, LinkedIn, or Twitter).  Voting closes at 12:00 AM on December 17th, at which point the votes will be tallied and the winners announced.  You can find the Clustify Blog voting page here.

Photos from ILTACON 2018

ILTACON 2018 was held at the iltaGaylord National Resort & Convention Center in National Harbor, Maryland.  I wasn’t able to attend the sessions (so I don’t have any notes to share) because I was manning the Clustify booth in the exhibit hall, but I did take a lot of photos which you can view here.  The theme for the reception this year was video games, in case you are wondering about the oddly dressed people in some of the photos.

Detecting Fraud Using Benford’s Law: Mathematical Details

When people fabricate numbers for fraudulent purposes they often fail to take Benford’s Law into account, making it possible to detect the fraud.  This article is a supplement to my article “Detecting Fraud Using Benford’s Law” (if the link doesn’t take you directly to the right page, it is PDF page number 69 or printed page number 67) from the Summer 2015 issue of Criminal Justice.

Benford’s Law says that naturally occurring numbers that span several orders of magnitude (i.e., differing numbers of digits, or differing powers of 10 when written in scientific notation like 3.15 x 102) should start with “1” 30.1% of the time, and they should start with “9” only 4.6% of the time.  The probability of each leading digit is given in this chart (click to enlarge):


Someone who attempts to commit fraud by fabricating numbers (e.g., fake invoices or accounting entries) without knowing Benford’s Law will probably generate numbers that don’t have the expected probability distribution.  They might, for example, assume that numbers starting with “1” should have the same probability as numbers starting with any other digit, resulting in their fraudulent numbers looking very suspicious to someone who knows Benford’s Law.

The Criminal Justice article details the history of Benford’s Law and explains when Benford’s Law is expected to be applicable.  What I’ll add here is more mathematical detail on how the probability of a particular leading digit, or sequence of digits, can be computed.

The key assumption behind Benford’s Law is scale invariance, meaning that things shouldn’t change if we switch to a different unit of measure.  If we convert a large set of monetary values from dollars to yen, or pesos, or any other currency (real or concocted), the percentage of values starting with a particular digit should stay (approximately) the same.  Suppose we convert from dollars to a currency that is worth half as much.  An item that costs $1 will cost 2 units of the new currency.  An item that costs $1.99 will cost 3.98 units of the new currency.  Likewise, $1000 becomes 2000 units of the new currency, and $1999 becomes 3998 units of the new currency.  So the probability of a number starting with “1” has to equal the sum of the probabilities of a numbers starting with “2” or “3” if the probability of a particular digit will remain unchanged by switching currencies.  The probabilities from the bar chart above behave as expected (30.1% = 17.6% + 12.5%).

To prove that scale invariance leads to the probabilities predicted by Benford’s Law, start by converting all possible numbers to scientific notation (e.g. 315 is written as 3.15 x 102) and realize that the power of 10 doesn’t matter when our only concern is the probability of a certain leading digit.  So all numbers map to the interval [1,10) as shown in this figure:


Next, assume there is some function, f(x), that gives the probability of each possible set of leading digits (technically a probability density function), so f(4.25) accounts for the probability of finding a value to be 0.0425, 0.425, 4.25, 42.5, 425, 4250, etc..  Our goal is to find f(x).  This graph illustrates the constraint that scale invariance puts on f(x):


The area under the f(x) curve between x=2 and x=2.5, shown in red, must equal the area between x=3 and x=4, shown in orange, because a change in scale that multiplies all values by 2 will map the values from the red region into the orange region.  Such relationships between areas under various parts of the curve must be satisfied for any change of scale, not just a factor of two.

Finally, let’s get into the gory math and prove Benford’s Law (warning: calculus!).  The probability, P(D), of a number starting with digit D is the area under the f(x) curve from D to D+1:

P(D) = \int_D^{D+1} f(x) \,dx

Assuming that scale invariance holds, the probability has to stay the same if we change scale such that all values are multiplied by β:

P(D) = \int_{\beta D}^{\beta (D+1)} f(x) \,dx

The equation above must be true for any β, so the derivative with respect to β must be zero:

\frac{\partial}{\partial \beta} P(D) = 0 \ \ \ \Rightarrow\ \ \ (D + 1) f\left(\beta(D + 1)\right) - D f(\beta D) = 0

The equation above is satisfied if f(x)=c/x, where c is a constant.  The total area under the f(x) curve must be 1 because it is the probability that a number will start with any possible set of digits, so that determines the value of c to be 1/ln(10), i.e. 1 over the natural logarithm of 10:

\int_1^{10} f(x) \,dx = 1 \ \ \ \Rightarrow\ \ \ f(x) = \frac{1}{x \ln(10)}

Finally, plug f(x) into our first equation and integrate to get a result in terms of base-10 logarithms:

P(D) = \frac{\ln(D+1) - \ln(D)}{\ln(10)} = \log_{10}(D + 1) - \log_{10}(D)

Knowing f(x), we can compute the probability of finding a number with any sequence of initial digits.  To find the probability of starting with 2 we integrated from 2 to 3.  To find the probability of starting with the two digits 24, we integrate f(x) from 2.4 to 2.5.  To find the probability of starting with the three digits 247, we integrate f(x) from 2.47 to 2.48.  The general equation for two leading digits, D1D2, is:

P(D_1D_2) = \log_{10}(D_1.D_2 + 0.1) - \log_{10}(D_1.D_2)

Which is equivalent to:

P(D_1D_2) = \log_{10}(D_1D_2 + 1) - \log_{10}(D_1D_2)

For example, the probability of a number starting with “2” followed by “4” is log10(25)-log10(24) = 1.77%.

Similarly, the equation for three leading digits, D1D2D3, is:

P(D_1D_2D_3) = \log_{10}(D_1D_2D_3 + 1) - \log_{10}(D_1D_2D_3)

SSD Storage Can Lose Data When Left Without Power

I came across this article today, and I think it is important for everyone to be aware of it.  It says that SSDs (solid-state drives), which are becoming increasingly popular for computer storage due to their fast access times and ability to withstand being dropped, “need consistent access to a power source in order for them to not lose data over time. There are a number of factors that influence the non-powered retention period that an SSD has before potential data loss. These factors include amount of use the drive has already experienced, the temperature of the storage environment, and the materials that comprise the memory chips in the drive.”  Keep that risk in mind if computers are powered down during a legal hold.  The article gives details about how long the drives are supposed to retain data while powered down.