October | 2015 | Clustify Blog – eDiscovery, Document Clustering, Technology-Assisted Review (Predictive Coding), Information Retrieval, and Software Development

When people fabricate numbers for fraudulent purposes they often fail to take Benford’s Law into account, making it possible to detect the fraud. This article is a supplement to my article “Detecting Fraud Using Benford’s Law” (if the link doesn’t take you directly to the right page, it is PDF page number 69 or printed page number 67) from the Summer 2015 issue of Criminal Justice.

Benford’s Law says that naturally occurring numbers that span several orders of magnitude (i.e., differing numbers of digits, or differing powers of 10 when written in scientific notation like 3.15 x 10²) should start with “1” 30.1% of the time, and they should start with “9” only 4.6% of the time. The probability of each leading digit is given in this chart (click to enlarge):

Someone who attempts to commit fraud by fabricating numbers (e.g., fake invoices or accounting entries) without knowing Benford’s Law will probably generate numbers that don’t have the expected probability distribution. They might, for example, assume that numbers starting with “1” should have the same probability as numbers starting with any other digit, resulting in their fraudulent numbers looking very suspicious to someone who knows Benford’s Law.

The Criminal Justice article details the history of Benford’s Law and explains when Benford’s Law is expected to be applicable. What I’ll add here is more mathematical detail on how the probability of a particular leading digit, or sequence of digits, can be computed.

The key assumption behind Benford’s Law is scale invariance, meaning that things shouldn’t change if we switch to a different unit of measure. If we convert a large set of monetary values from dollars to yen, or pesos, or any other currency (real or concocted), the percentage of values starting with a particular digit should stay (approximately) the same. Suppose we convert from dollars to a currency that is worth half as much. An item that costs $1 will cost 2 units of the new currency. An item that costs $1.99 will cost 3.98 units of the new currency. Likewise, $1000 becomes 2000 units of the new currency, and $1999 becomes 3998 units of the new currency. So the probability of a number starting with “1” has to equal the sum of the probabilities of a numbers starting with “2” or “3” if the probability of a particular digit will remain unchanged by switching currencies. The probabilities from the bar chart above behave as expected (30.1% = 17.6% + 12.5%).

To prove that scale invariance leads to the probabilities predicted by Benford’s Law, start by converting all possible numbers to scientific notation (e.g. 315 is written as 3.15 x 10²) and realize that the power of 10 doesn’t matter when our only concern is the probability of a certain leading digit. So all numbers map to the interval [1,10) as shown in this figure:

Next, assume there is some function, f(x), that gives the probability of each possible set of leading digits (technically a probability density function), so f(4.25) accounts for the probability of finding a value to be 0.0425, 0.425, 4.25, 42.5, 425, 4250, etc.. Our goal is to find f(x). This graph illustrates the constraint that scale invariance puts on f(x):

The area under the f(x) curve between x=2 and x=2.5, shown in red, must equal the area between x=3 and x=4, shown in orange, because a change in scale that multiplies all values by 2 will map the values from the red region into the orange region. Such relationships between areas under various parts of the curve must be satisfied for any change of scale, not just a factor of two.

Finally, let’s get into the gory math and prove Benford’s Law (warning: calculus!). The probability, P(D), of a number starting with digit D is the area under the f(x) curve from D to D+1:

$P(D) = \int_D^{D+1} f(x) \,dx$

Assuming that scale invariance holds, the probability has to stay the same if we change scale such that all values are multiplied by β:

$P(D) = \int_{\beta D}^{\beta (D+1)} f(x) \,dx$

The equation above must be true for any β, so the derivative with respect to β must be zero:

$\frac{\partial}{\partial \beta} P(D) = 0 \ \ \ \Rightarrow\ \ \ (D + 1) f\left(\beta(D + 1)\right) - D f(\beta D) = 0$

The equation above is satisfied if f(x)=c/x, where c is a constant. The total area under the f(x) curve must be 1 because it is the probability that a number will start with any possible set of digits, so that determines the value of c to be 1/ln(10), i.e. 1 over the natural logarithm of 10:

$\int_1^{10} f(x) \,dx = 1 \ \ \ \Rightarrow\ \ \ f(x) = \frac{1}{x \ln(10)}$

Finally, plug f(x) into our first equation and integrate to get a result in terms of base-10 logarithms:

$P(D) = \frac{\ln(D+1) - \ln(D)}{\ln(10)} = \log_{10}(D + 1) - \log_{10}(D)$

Knowing f(x), we can compute the probability of finding a number with any sequence of initial digits. To find the probability of starting with 2 we integrated from 2 to 3. To find the probability of starting with the two digits 24, we integrate f(x) from 2.4 to 2.5. To find the probability of starting with the three digits 247, we integrate f(x) from 2.47 to 2.48. The general equation for two leading digits, D₁D₂, is:

$P(D_1D_2) = \log_{10}(D_1.D_2 + 0.1) - \log_{10}(D_1.D_2)$

Which is equivalent to:

$P(D_1D_2) = \log_{10}(D_1D_2 + 1) - \log_{10}(D_1D_2)$

For example, the probability of a number starting with “2” followed by “4” is log₁₀(25)-log₁₀(24) = 1.77%.

Similarly, the equation for three leading digits, D₁D₂D₃, is:

$P(D_1D_2D_3) = \log_{10}(D_1D_2D_3 + 1) - \log_{10}(D_1D_2D_3)$

The conference moved from Florida to Washington, D.C. this year. It was two full days of talks, often with two simultaneous sessions. Attendance seemed to be up compared to last year. My notes below provide only a few highlights from the subset of the sessions that I was able to attend.

Keynote: Business as (Un)usual: Leveraging the Changing Legal Marketplace
Law firms need to do a better job of paying attention to what their clients want. Clients want collaboration, teamwork, and compensation based on performance not credentials. Move beyond e-discovery and help them avoid litigation.

The Journey of 1000 Terabytes Begins with a Single Email: A Step-byStep Guide to Applied Information Governance
Success in IG requires that it be phased in and have C-level buy-in, budget, dedicated staff, process, and technology. Who owns the data on gmail, Skype, etc.? You must provide good tools internally or employees will use external tools like Slack, allowing data to leak out. Make sure employees know the policy about taking their phones if necessary due to e-discovery. Need to have employees periodically re-read and acknowledge IG policies. Will BYOD die due to lack of separation between company and personal data?

You’ve Been Hacked, Now What? The Justice Department gives Guidance on Data Breach Mitigation, Response and Ethical Conundrums
I couldn’t attend

E-Discovery for the Other 85 Percent: Achieving Proportionality and Defensibility in Small Cases
This session was on e-discovery for small cases. There were several (sometimes controversial) tips for keeping costs down, including emailing custodian questionnaires instead of interviewing, collecting specific file types instead of imaging hard drives, viewing PST files in Outlook instead of processing (an audience member commented that the emails could be changed in Outlook and white text on a white background might be missed), and skipping privilege review if not needed (use a clawback agreement). Judge Nolan said the biggest cost in e-discovery is judges, lawyers, and clients not knowing what they are doing. She pointed out the Discovery Pilot Program.

Securing Client Data in a Post-Sony World: Shoring up Breach Points Among Clients, Law Firm and Vendor Partners
I couldn’t attend

Federal Judge Discusses E-Discovery Related Issues and Offers Guidance Regarding Persistent and Emerging Conundrums
Part of the session focused on e-discovery in criminal litigation, where the government is usually the producing party. Judge Vanaskie said that criminal defense lawyers do seem to know e-discovery pretty well. He mentioned that Apple asks to have its vendors’ ESI charges sealed. He explained that “taxation” means having the losing party pay the winning party’s e-discovery costs.

Helping ACEDS Members in Transition
I couldn’t attend

Practical Tips on Reducing Corporate Litigation Risks and Costs
In-house counsel needs to understand the business to be seen as a partner rather than as an obstacle. Analyze contracts and be careful about arbitration clauses with no e-discovery limit. One panel member suggested shopping e-discovery vendors for the best price while another pointed out that relationships may be more important than price. Reduce the risk of a Sony-like problem by deleting old data. It may be wise to defend against frivolous lawsuits, even if defense is expensive compared to the amount of the suit, to build a reputation that will avoid getting sued over and over. Deleting active data won’t help if off-site backups remain. Law firms are not good on IG — they tend to over-preserve. Might want to avoid integrated voicemail/email because you may end up having to do e-discovery on .WAV files. Law firms are too slow on moving to technology-assisted review (TAR).

Moore’s Law, Artificial Intelligence and the Coming Impact of Technology on Law and Discovery
Driverless cars will impact truckers, taxi drivers, and people in the auto insurance industry. Will AI replace lawyers? The Singularity is when computers become smarter than people, and is predicted to come as soon as 2048. CPUs are getting faster (Moore’s Law), storage is getting cheaper (Kryder’s Law), bandwidth is increasing (Nielsen’s Law), and the value of networks increases as the number of nodes increases (Metcalfe’s Law). Moravec’s Paradox says that high-order tasks are easy to program but low-order tasks are hard. Can computers be creative?

The EDRM eMSAT – E-Discovery Maturity Self Assessment Test
I couldn’t attend

Behind TAR’s ‘Vale’: How to Strike the Balance between Transparency, Disclosure and Cooperation
This was a somewhat contentious session. One panelist said he would disclose the use of TAR (but thought you didn’t have to), but would expect to reveal no more than that. Another panelist advocated the disclosure of seed sets and algorithms. Another pointed out that disclosure of non-responsive seed documents could be bad if the requesting party is a competitor. The argument that a seed set is work product may apply if the seed set is a judgmental sample (the specific documents chosen for training were picked by the lawyer), but may not apply for a random sample.

A Federal Judge’s Views Regarding E-Discovery Trends and Their Implications
New rules acknowledge that judges can shift costs when necessary. There are many tools for proportionality (sampling, capping time and money spent). Judge Grimm thought it was better for judges to be active so e-discovery problems could be avoided. Sanctions for failure to preserve are only available if there was an intent to deprive the requesting party — this promotes reasonableness instead of having different rules in different jurisdictions.

Challenging the Assumptions, Claims, and Givens of TAR to Make It More Effective and Just
I was on the panel, so I didn’t take notes

Swimming in the Blender: Successfully Navigating, Surviving (and maybe even surfing) E-Discovery Career Challenges
I couldn’t attend

Courts’ Vetting TAR Technologies and Methodologies: What Is the Proper Standard of Review?
Judge Waxse said lawyers want to drag out the case (billable hours), whereas judges and clients want a just and speedy trial — judges should try to involve the client to move things along. Judge Vanaskie said active management is imperative (agreeing with Judge Grimm’s earlier statement). He doesn’t like special masters — he wants to know what is going on. Judge Waxse said that “zealous advocacy” is gone and never applied to e-discovery — the culture needs to be fixed (regarding cooperation). He also said that Daubert applies to all proceedings, including e-discovery, not just the trial (disagreeing with Judge Peck’s writing on this). On the topic of seed sets being work product, Judge Vanaskie questioned whether the seed set itself really reveals the thought process that went into selecting the seed documents. Judge Nolan said that keyword search queries are not work product. Judge Levie said that disclosing seed sets was not one-size-fits-all. On use of special masters, Judge Levie said whether there was private communication between the judge and the special master varies.

Found Money: Raising E-Discovery Related Realization Rates
I couldn’t attend

Judges’ Review of E-Discovery’s New Rules, Rulings and Requirements
The new rules move proportionality back to where it originally was. Judge Waxse said the current rules cause problems because you can’t assess the proportionality factors (amount in controversy, needs of the case, etc.) early in the case. Judge Rodriguez questioned where the boundary is between a judge managing the e-discovery and advocating for a side. He also said the amended rules encourage face-to-face with the judge. Most sanctions for preservation failure involve both bad faith and dishonesty. Lawyers are conservative — will they really tell clients to lighten up on preservation under the new rules? Preserved data can help your case, too. Retention policy should depend on content, not just format, so deletion of all email after some number of days (e.g., 30 or 75) is bad.

Exploiting the New Rules, Rulings and Requirements
I couldn’t attend

Reengineering How E-Discovery is Practiced (and Managed) Using Data, Dash-Boarding and even Dynamic Organizations
I couldn’t attend

“ED” Talks: Industry Experts Tell Us What’s On Their Mind. And We All React
I couldn’t attend

Clustify Blog – eDiscovery, Document Clustering, Technology-Assisted Review (Predictive Coding), Information Retrieval, and Software Development

Thoughts on e-discovery, computers, and software development.

Monthly Archives: October 2015

Detecting Fraud Using Benford’s Law: Mathematical Details

Highlights from the ACEDS 2015 E-Discovery Conference