# Evaluating Dr. Shiva’s Claims of Election Fraud in Michigan

This article examines Dr. SHIVA Ayyadurai’s claims that the shape of some graphs generated from Michigan voting data suggests that the vote count was being fraudulently manipulated. To be clear, I am not making any claim about whether or not fraud occurred — I’m only addressing whether Dr. Shiva’s arguments about curve shape are convincing.

I’ll start by elaborating on some tweets I wrote (1, 2, 3) in response to Dr. Shiva’s first video (twitter, youtube) and I’ll respond to his second video (twitter, youtube) toward the end. Dr. Shiva makes various claims about what graphs should look like under normal circumstances and asserts that deviations are a signal that fraud has occurred. I use a little math to model reasonable voter behavior in order to determine what should be considered normal, and I find Dr. Shiva’s idea of normal to be either wrong or too limited. The data he considers to be so anomalous could easily be a consequence of normal voter behavior — there is no need to talk about fractional votes or vote stealing to explain it.

Here is a sample ballot for Michigan. Voters have the option to fill in a single box to vote for a particular party for all offices, referred to as a straight-party vote. Alternatively, they can fill in a box for one candidate for each office, known as an individual-candidate vote. In his first video, Dr. Shiva compares the percentage of straight-party votes won by the Republicans to the percentage of individual-candidate votes won by Trump and claims to observe patterns that imply some votes for Trump were transferred to Biden in a systematic way by an algorithm controlling the vote-counting machine.

x = proportion of straight-party votes going to the Republican party for a precinct
i = proportion of individual-candidate presidential votes going to Trump for a precinct
y = ix

I’ll be using proportions (numbers between 0.0 and 1.0) instead of percentages to avoid carrying around a lot of factors of 100 in equations. I assume you can mentally convert from a proportion (e.g., 0.25) to the corresponding percentage (e.g., 25%) as needed. Equations are preceded by a number in brackets (e.g., [10]) to make it easy to reference them. You can click any of the graphs to see a larger version.

Dr. Shiva claims there are clear signs of fraud in three counties: Oakland, Macomb, and Kent. The data for each precinct in Kent County is available here (note that their vote counts for Trump and Biden include both straight-party and individual-candidate votes, so you have to subtract out the straight-party votes when computing i). If we represent each precinct as a dot in a plot of y versus x, we get (my graph doesn’t look as steep as Dr. Shiva’s because his vertical axis is stretched):

Dr. Shiva claims the data should be clustered around a horizontal line (video 1 at 22:08) and provides drawings of what he expects the graph to look like:

He asserts that the downward slope of the data for Kent County implies an algorithm is being used to switch Trump votes to Biden in the vote-counting machine. In precincts where there are more Republicans (large x), the algorithm steals votes more aggressively, causing the downward slope. As a sanity check on this claim, let’s look at things from Biden’s perspective instead of Trump’s. We define a set of variables similar to the ones used above, but put a prime by each variable to indicate that it is with respect to Biden votes instead of Trump votes:

x‘ = proportion of straight-party votes going to the Democrat party for a precinct
i‘ = proportion of individual-candidate presidential votes going to Biden for a precinct
y‘ = i’x’

Here is the graph that results for Kent County:

If the Biden graph looks like the Trump graph just flipped around a bit, that’s not an accident. Requiring proportions of the same whole to add up to 1 and assuming third-party votes are negligible (the total for all third-party single-party votes averages 1.5% with a max of 4.3% and for individual-candidate votes the average is 3.3% with a max of 8.6%, so this assumption is a pretty good one that won’t impact the overall shape of the graph significantly) gives:

[1]    x + x’ = 1
[2]    i + i’ = 1

which implies:

[3]    x’ = 1 – x
[4]    y’ = –y

Those equations mean we can find an approximation to the Biden graph by flipping the Trump graph horizontally around a vertical line x = 0.5 and then flipping it vertically around a horizontal line y = 0, like this (the result in the bottom right corner is almost identical to the graph computed above with third-party candidates included):

As a result, if the Trump data is clustered around a straight line, the Biden data must be clustered around a straight line with the same slope but different y-intercept, making it appear shifted vertically.

The Biden graph slopes downward, so by Dr. Shiva’s reasoning an algorithm must be switching votes from Biden to Trump, and it does so more aggressively in precincts where there are a lot of Democrats (large x’). Wait, is Biden stealing from Trump, or is Trump stealing from Biden? We’ll come back to this question shortly.

Dr. Shiva shows Oakland County first in his video. I made a point of showing you Kent County first so you could see it without being biased by what you saw for Oakland County. This is Dr. Shiva’s graph of Trump votes for Oakland:

The data looks like it could be clustered around a horizontal line for x < 20%. Dr. Shiva argues that the algorithm kicks in and starts switching votes from Trump to Biden only for precincts having x > 20%.

We can look at it (approximately) in terms of Biden votes by flipping the Trump graph twice using the procedure outlined in Figure 5:

The data in the Biden graph appears to be clustered around a horizontal line for x’ > 80%, which is expected since it corresponds to x < 20% in the Trump graph. If you buy the argument that the data should follow a horizontal line when it is unmolested by the cheating algorithm, this pair of graphs finally answers the question of who is stealing votes from whom. Since the y-values for x > 20% are less than the y-values in the x < 20% (“normal”) region, Trump is being harmed in the cheating region. Consistent with that, Biden’s y’-value for x’ > 80% (the “normal” region) is about -5% and y’ is larger in the x’ < 80 region where the cheating occurs, so Biden benefits from the cheating. Trump is the one losing votes and Biden is the one gaining them, if you buy the argument about the “normal” state being a horizontal line.

Dr. Shiva draws a kinked line through the data for Kent (video 1 at 36:20) and Macomb (video 1 at 34:39) Counties with a flat part when x is small, similar to his graph for Oakland County, but if you look at the data without the kinked line to bias your eye, you probably wouldn’t think a kink is necessary — a straight line would fit just as well, which leaves open the question of who is taking votes from whom for those two counties.

Based on the idea that the data should be clustered around a horizontal line, Dr. Shiva claims that 69,000 Trump votes were switched to Biden by the algorithm in the three counties (video 1 at 14:00 or this tweet).

All claims of cheating, who is stealing votes from whom, and the specific number of votes stolen, are riding on the assumption that the data has to be clustered around a horizontal line in the graphs if there is no cheating. That critical assumption deserves the utmost scrutiny, and you’ll see below that it is not at all reasonable.

In Figure 8 below it is impossible for the data point for any precinct to lie in one of the orange regions because that would imply Trump received either more than 100% or less than 0% of the individual-candidate votes. For example, if x = 99%, you cannot have y=10% because that implies Trump received 109% of the individual-candidate votes (i = y + x). Any model that gives impossible y-values for plausible x-values must be at least a little wrong. The only horizontal line that doesn’t encroach on one of the orange region is y = 0.

Before we get into models of voter behavior that are somewhat realistic, let’s consider the simplest model possible that might mimic Dr. Shiva’s thinking to some degree. If the individual-candidate voters are all Republicans and Democrats in the exact same proportions as in the pool of single-party voters, and all Republicans vote for Trump while all Democrats vote for Biden, we would expect i = x and therefore y = 0 (i.e., the data would cluster around a horizontal line with y = 0). Suppose 10% of Democrats in the individual-candidate pool decide to defect and vote for Trump while all of the Republicans vote for Trump. That would give i = x + 0.1 * (1 – x). The factor of (1 – x) represents the number of Democrats that are available to defect (there are fewer of them toward the right side of the graph). That gives y = 0.1 – 0.1 * x, which is a downward-sloping line that starts at y = 10% at the left edge of the graph and goes down to y = 0% at the right edge of the graph, thus never encroaching on the orange region in Figure 8. Dr. Shiva also talks about the possibility of Republicans defecting away from Trump (video 1 at 44:43) and shows data clustered around a horizontal line at y = -10%. Again applying the simplest possible thinking, if 10% of Republicans defected away from Trump we would have i = 0.9 * x, so y = -0.1 * x. Data would again cluster around a downward-sloping line. This time it would start at y = 0% at the left edge and go down to y = -10% at the right edge. The only possible horizontal line is y = 0. Everything else wants to slope downward.

The model described in the previous paragraph is really too simple to describe reality in most situations. There are no Independent voters in that model, and it assumes the individual-candidate voting pool has the same percentage of Republicans as the straight-party pool. In reality, individual-candidate voters shouldn’t be expected to be just like straight-party voters — they choose to vote that way for a reason. Below I lay out some simple models for how different types of voters might reasonably be expected to behave. The focus is on determining how things depend on x so we can compute the shape of the y versus x curve. After describing different types of individual-candidate voters, I explain how to combine the different types into a single model to generate the curve around which the data is expected to cluster. If the model accommodates the data that is observed, there is no need to talk about cheating or how it would impact the graphs — you cannot prove cheating (though it may still be occurring) if the graph is consistent with normal voter behavior. In the following, the equations relating x to i or y apply to the curves around which the data clusters, not the position of any individual precinct.

Type 1 (masochists): Imagine the individual-candidate voters are actually Republicans voting for Trump or Democrats voting for Biden, but they choose not to use the straight-party voting option for some reason. Perhaps a Republican intended to vote for the Republican candidate for every office, but didn’t notice the straight-party option, or perhaps he/she is a masochist who enjoys filling in lots of little boxes unnecessarily (I’ll call all Type 1 people masochists even though it really only applies to a subset of them because I can’t think of a better name). Maybe a Republican votes for the Republican candidate for every office except dog catcher because his/her best friend is the Democratic candidate for that office (can Republicans and Democrats still be friends?). With this model, the number of individual-candidate voters that vote for Trump is expected to be proportional to the number of Republicans. We don’t know how many Republicans there are in total, but we can assume the number is proportional to x, giving A * x individual-candidate votes for Trump where A is a constant (independent of x). Similarly, Biden would get A’ * (1 – x) individual-candidate votes. If all individual-candidate voters are of this type, we would have:
[5]    i = A * x / [A * x + A’ * (1-x)]
If the same proportion of Democrats are masochists as Republicans, A = A’, we have i = x, so y = ix gives y = 0, meaning the data will be clustered around the horizontal line y = 0, which is consistent with the view Dr. Shiva espouses in his first video. This model does not, however, support data being clustered around a horizontal line with the y-value being different from zero. If A is different from A’, the data will be clustered around a curve as shown in Figure 9 below.

Type 2 (Independents): Imagine the individual-candidate voters are true Independent voters. Perhaps they aren’t fond of either party, so they vote for the presidential candidate they like the most (or hate the least) and vote for the opposite party for any congressional positions to keep either party from having too much power, necessitating an individual-candidate vote instead of a straight-party vote. Maybe they vote for each candidate individually based on their merits and the candidates they like don’t happen to be in the same party. How should the proportion of Independents voting for Trump depend on x? Roughly speaking, it shouldn’t. The value of x tells what proportion of a voter’s neighbors are casting a straight-party vote for the Republicans compared to the Democrats. The Independent voter makes his/her own decision about who to vote for. The behavior of his/her neighbors should have little impact (perhaps they experience a little peer pressure or influence from political yard signs). Democrats are expected to mostly vote for Biden regardless of who their neighbors are voting for. Republicans are expected to mostly vote for Trump regardless of who their neighbors are voting for. Likewise, Independents are not expected to be significantly influenced by x. If all individual-candidate voters are of this type, we have i = b, where b is a constant (no x dependence), so y = bx, meaning the data would be clustered around a straight line with slope -1 as shown in Figure 10 below.

Type 3 (defectors): In this case we have some percentage of Democrats defecting from their party to vote for Trump. Likewise, some percentage of Republicans defect to vote for Biden. This is mathematically similar to Type 1, except Trump now gets votes in proportion to (1 – x) instead of x, reflecting the fact that his individual-candidate votes increase when there are more Democrats available to defect. If all individual-candidate voters are of this type, we have:
[6]    i = C * (1 – x) / [C * (1 – x) + C’ * x)]
If the same proportion of Democrats defect as Republicans, C = C’, we have i = 1 – x, so y = 1 – 2 * x, causing the data to cluster around a straight line with slope of -2. If C and C’ are different, the data will be clustered around a curve as shown in Figure 11 below.

Realistically, the pool of individual-candidate voters should have some amount of all three types of voters described above. To compute i, and therefore y, we need to add up the votes (not percentages) from various types of voters. We’ll need some additional notation:

NSP = total number of straight-party voters (all parties) for the precinct (this is known)
NIC = total number of individual-candidate voters (this is known)
I = number of Independent (Type 2) voters (not known)
v = number of individual-candidate votes for Trump
v’ = number of individual-candidate votes for Biden

The number of individual-candidate votes for Trump would be:
[7]    v = A * x * NSP + b * I + C * (1 – x) * NSP
and the number for Biden would be:
[8]    v’ = A’ * (1 – x) * NSP + (1 – b) * I + C’ * x * NSP
The total number of individual-candidate voters comes from adding those two expressions and regrouping the terms:
[9]    NIC = v + v’ = A’ * NSP + (AA’) * x * NSP + I + C * NSP + (C’C) * x * NSP

The last equation tells us that if we divide the number of individual-candidate votes by the number of straight-party votes for each precinct, NIC / NSP, and graph it as a function of x, we expect the result to cluster around a straight line (assuming I / NSP is independent of x). If the behavior of Republicans and Democrats was exactly the same (A = A’ and C = C’), the straight line would be horizontal. Here is the graph:

The line was fit using a standard regression. The fact that it slopes strongly upward tells us Republicans and Democrats do not behave the same. A larger percentage of Republicans cast individual-candidate votes than Democrats, so Republican-heavy precincts (large x) have a lot more individual-candidate votes. The number of straight-party votes also increases with x, but not as dramatically, suggesting that Republican precincts either tend to have more voters or tend to have higher turnout rates. By requiring our model to match the straight line in the figure above, we can remove two degrees of freedom (corresponding to the line’s slope and intercept) from our set of six unknown parameters (A, A’, b, I/NSP, C, C’).

We compute y = ix = v / NICx. Fitting the y versus x graph can remove two more degrees of freedom. To completely nail down the parameters, we need to make an assumption that will fix two more parameters. Since the slope of the y versus x graph for Kent County lies between 0 (Type 1 voters) and -1 (Type 2 voters), we will probably not do too much damage by assuming there are no Type 3 voters, so C = 0 and C’ = 0. We are now in a position to determine all of the remaining parameters by requiring the model to fit the NIC / NSP versus x data from Figure 12 and the y versus x data, giving:

[10]    A = 0.5, A’ = 0.09, b = 0.073, I / NSP = 0.41, C = 0, C’ = 0

The curve generated by the model is not quite a straight line — it shows a little bit of curvature in the graph above. That curvature is in good agreement with the data. If NIC depends on x, as it will when A is different from A’ or when C is different from C’, there will be some curvature to the y versus x graph. In other words, when there are differences between the behavior of Republicans and Democrats this simple model will generate a y versus x graph having curvature. When there is no difference in behavior, it gives a straight line.

The relatively simple model seems to fit the data nicely. The remaining question is whether the parameter values are reasonable. If they are, we can conclude that the observed data is consistent with the way we expect voters to behave, so the graph does not suggest any fraud. If the parameter values are crazy, there may be fraud or our simple model of voter behavior may be inadequate. It will be easier to assess the reasonableness of our parameters if they are proportions (or percentages), which A, A’, C, and C’ aren’t. We would like to know the proportion of Republicans (or Democrats) voting in a particular way. We start by writing out the number of Republicans, R, according to the model as just the sum of straight-party Republican voters plus individual-candidate Republicans voting for Trump (the A term) and defector Republicans voting for Biden (the C’ term). A similar approach determines the number of Democrats.

[11]    R = x * NSP + A * x * NSP + C’ * x * NSP
[12]    D = (1 – x) * NSP + A’ * (1 – x) * NSP + C * (1 – x) * NSP

We now define new parameters:

a = the proportion of Republicans voting for Trump by individual-candidate ballot
a’ = the proportion of Democrats voting for Biden by individual-candidate ballot
c = the proportion of Democrats that defect to vote for Trump
c’ = the proportion of Republicans that defect to vote for Biden

[13]    a = A * x * NSP / R = A / (1 + A + C’)
[14]    a’ = A’ / (1 + A’ + C)
[15]    c = C * (1 – x) * NSP / D = C / (1 + A’ + C)
[16]    c’ = C’ / (1 + A + C’)

In this more convenient (for understanding, but not for writing equations) parameterization we have:

[17]    a = 0.33, a’ = 0.083, b = 0.073, I / NSP = 0.41, c = 0, c’ = 0

In words, 33% of Republicans use an individual-candidate ballot to vote for Trump instead of a straight-party vote. Only 8.3% of Democrats use an individual-candidate ballot to vote for Biden instead of a straight-party vote. Only 7.3% of Independents voted for Trump, with the other 92.7% voting for Biden. The number of Independent voters is, on average, 41% of the total number of straight-party voters, which means Independents are considerably less than 41% of all voters (since some Republicans and Democrats don’t vote straight-party). Some values seem a little extreme, such as only 7.3% of Independents voting for Trump, but none are completely pathological. Parameter values would shift around a bit if we allowed non-zero values for c and c’ (defectors). It is worth noting that when I talk about the number of Republicans, Democrats, and Independents, I am not talking about the number of people that registered that way — I base those numbers on their behavior (i.e., the assumption that the number of Republicans is proportional to x). With all of those things in mind, I think it is safe to say that the graphs are consistent with reasonable expectations of voter behavior (no need for fraud to explain the shape), but the parameter values shouldn’t be taken too seriously.

The simple models above show a wide range of possible slopes for the data, going from 0 to -2 (when parameter values generate a straight line). A horizontal line (0 slope) requires there to be no Independent voters and no defectors. Furthermore, it requires the percentage of Republicans and Democrats choosing to use individual-candidate voting to be exactly the same (A = A’). The assumption that data should cluster around a horizontal line is really an extreme assumption that requires things to be perfectly balanced. Claiming that deviation from a horizontal line is a sign of fraud is like observing a coin toss come up heads or tails and proclaiming there must be cheating because a fair coin would have landed on its edge. Dr. Shiva’s videos never show an example of real data clustering around a horizontal line. He does show a graph for Wayne County (video 1 at 38:02) and claims it lacks the algorithmic cheating seen in the other three counties, but all of the data for Wayne County is confined to such a small range of x values that you can’t conclude much of anything about the slope.

Dr. Shiva’s second video starts by talking about signal detection and the importance of distinguishing the “normal state” from an “abnormal state” in various contexts. At 50:08 he states: “What we didn’t share in the first video is what is a normal state?” This would be a good time to scroll up and take a second look at Figure 2, which is a screen shot from the first video. He now claims the normal state would be to have the data in the y versus x graph clustered around a parabola. Horizontal lines are gone. Claims about the number of votes stolen based on expecting the data to follow a horizontal line are forgotten. He provides this graph from another election, Jeff Sessions for Senate in 2008, as his first example of the normal state:

The graph has negative curvature, meaning it is shaped like an upside down bowl. Positive curvature would be shaped like a bowl that is right-side up. He provides two more examples that also have negative curvature. He proclaims that there must be cheating in Oakland, Macomb, and Kent counties, not because they slope downward, but because they are too straight. As before, I’m going to flip the graph twice to see what it would look like in terms of Jeff Sessions’ competitor’s votes:

The flipped graph has positive curvature. If negative curvature is normal, positive curvature must also be normal. A straight line is just zero curvature. If some amount of negative curvature is normal and a similar amount of positive curvature is normal, it would be weird, but not impossible, for curvature values in between to be abnormal (note that this is a very different argument from what I said about the horizontal line y = 0, because that case was at the extreme end of the spectrum of possibilities, not in the middle). Anyway, I already showed a reasonable model of voter behavior accommodates both significant curvature (Figure 9) and straight lines (Figures 10 and 11), and I showed that Kent County has a little bit of curvature (Figure 13).

Dr. Shiva explains his claim that the normal state should be a parabola using this graph:

He claims there should be three different behaviors, resulting in a parabola, because there are three regions representing different types of voters. I think the labeling of the voters along the bottom of Figure 16 reveals some confused thinking. Why are there Independents in the middle section? Why does the quantity of Independents depend on the percentage of straight-party voters that vote Republican (i.e., the value of x)? Do Independents move out of the neighborhood if the number of Trump signs and Biden signs in the neighborhood are too far out of balance, or is the number of Independents really a separate variable (a third dimension, with Democrats and Republicans being the other two)? In my simple model above, which could certainly be wrong, curvature comes from differences in behavior between Republicans and Democrats (Figure 9), whereas more Independents makes the curve straighter (Figure 10).

Dr. Shiva introduces some new graphs in the second video at 1:02:21 that he claims are additional evidence of problems in the three counties. Instead of working with percentages, he uses the raw number of votes. He graphs the number of individual-candidate votes for Trump, v, versus the number of single-party votes for the Republicans, w. Similarly, he graphs the number of individual-candidate votes for Biden, v’, versus the number of single-party votes for the Democrats, w’. He overlaid them on the same graph, but I’ll separate them for clarity. Here are the results for Kent County:

I fit the lines with a standard regression because it is not quite possible to generate predicted curves using our model. Dr. Shiva’s concern is that the two graphs are so different. Specifically, the data in the Trump graph in Figure 17 is very tightly clustered around the straight line, whereas the Biden graph in Figure 18 shows the data to be much more spread out. We’ll return to that point after talking a bit about how the graphs relate to the model we used on the Kent County data earlier.

Expressions for v and v’ for our model were given earlier in Equations [7] and [8]. Noting that w = x * NSP, and w’ = (1 – x) * NSP, we can write v and v’ in terms of w and w’:

[18]    v = A * w + b * I
[19]    v’ = A’ * w’ + (1 – b) * I

The problem with graphing the model’s prediction is that I is a function of x with positive slope (our model treated I / NSP as a constant with value 0.41, but NSP itself depends on x as noted earlier), so we can’t use Equations [18] and [19] to graph the model curve. We can do some basic checks for consistency with our model, however. The Independent voter term contributes very little to v because b is small (Trump only gets 7.3% of the Independent vote in our model). So the slope of the v versus w curve should be a little more than A, which is 0.5, and the line in Figure 17 has a slope of 0.52. The slope of the v’ versus w’ curve should be A’, which is 0.09, plus (1 – b) times whatever I contributes to the slope. The line in Figure 18 has a slope of 0.26, which is larger than 0.09, as required, but it is unclear whether it is too large. The ratio of the y-intercepts for the lines in Figures 18 and 17 should be (1 – b) / b, which is 12.7, compared to 13.8 for the lines fitted to the data in the graphs.

While our model doesn’t say anything quantitative about the spread expected for the data, it can give us some qualitative guidance. The source of most of the individual-candidate votes for Trump is Republicans that choose to vote for individual candidates (masochists) rather than straight party. He gets only 7.3% of the Independent vote. By contrast, Biden gets a lot of his individual-candidate votes from Independents. This is reflected in Biden’s graph having a relatively large y-intercept. He gets around 200 votes even for precincts where there are no Democrats around to vote for him (w’ = 0 implies no straight-party Democrat voters and presumably very few individual-candidate voting Democrats) because he has 92.7% of the Independents.

We expect 33% of Republicans to vote for Trump with an individual-candidate ballot on average. We wouldn’t be surprised if some precincts have 25% or 40% instead of 33%, but we wouldn’t expect something wild like 10% or 80%, so the data points are expected to stay pretty close to the line for Trump. On the other hand, Biden gets a lot of votes from Independents and the number of Independents is expected to vary a lot between precincts. The number of Republicans varies a lot from precinct to precinct (based on x ranging from 10% to 80%), so it is reasonable to expect similar variation in the number of Independents, causing a large spread in Biden’s graph. The differences between Figures 17 and 18 are not surprising in light of the very different nature of the individual-candidate voters for Trump and Biden, which we already knew about due to the slope of Figure 12.

In summary, Dr. Shiva is right when he says it is important to distinguish normal behavior from abnormal behavior when trying to identify manipulated data. Where he comes up short is in determining what normal behavior should look like. If the data is consistent with a reasonable model of human behavior, it is normal and cannot be considered evidence of fraud. In his first video he claims a horizontal line is the only normal state, but in reality a horizontal line other than y = 0 would be highly abnormal. His second video gets closer to reality when claiming the normal state should be a parabola, but that is too limited — data with little or no curvature is perfectly reasonable, too.

# Highlights from IG3 West 2019

IG3 West was held at the Pelican Hill Resort in Newport Coast, California.  It consisted of one day of product demos followed by one day of talks.  The talks were divided into two simultaneous sessions throughout the day, so I could only attend half of them.  My notes below provide some highlights from the talks I attended.  You can find my full set of photos here.

Technology Solution Update from Corporate, Law Firm and Service Provider Perspective
How do we get the data out of the free version of Slack?  It is hard to get the data out of Office 365.  Employees are bringing in technologies such as Slack without going through the normal decision making process.  IT and legal don’t talk to each other enough.  When doing a pilot of legal hold software, don’t use it on a custodian that is on actual hold because something might go wrong.  Remember that others know much less than you, so explain things as if you were talking to a third grader.   Old infrastructure is a big problem.  Many systems haven’t really been tested since Y2K.  Business continuity should be a top priority.

Staying on Pointe: The Striking Similarities Between Ballet and eDiscovery
I wasn’t able to attend this one.

Specialized eDiscovery: Rethinking the Notion of Relevancy
Does traditional ediscovery still work?  The traditional ways of communicating and creating data are shrinking.  WeChat and WhatsApp are now popular.  Prepare the client for litigation by helping the client find all sources of data and format the exotic data.  Requesting party may want native format (instead of PDF) to get the meta data, but keep in mind that you may have to pay for software to examine data that is in exotic formats.  Slack meta data is probably useless (there is no tool to analyze it).  Be careful about Ring doorbells and home security systems recording audio (e.g., recording a contractor working in your home) — recording audio is illegal in some areas if you haven’t provided notification to the person being recorded.  Chat, voice, and video are known problems.  Emoji’s with skins and legacy data are less-known problems.  Before you end up in litigation, make sure IT people are trained on where data is and how to produce it.  If you are going to delete data (e.g., to reduce risk of high ediscovery costs in the future), make sure you are consistent about it (e.g., delete all emails after 3 months unless they are on hold).  Haphazard deletion is going to raise questions.  Even if you are consistent about deletion, you may still encounter a judge who questions why you didn’t just save everything because doing so is easier.  Currently, people don’t often go after text messages, but it depends on the situation.  Some people only text (no emails).  Oddest sources of data seen: a Venmo comment field indicating why a payment was made, and chat from an online game.

SaaS or Vendor – An eDiscovery Conversation
I wasn’t able to attend this one.

Ick, Math!  Ensuring Production Quality
I moderated this panel, so I didn’t take notes.  You can find the slides here.

Still Looking for the Data
I wasn’t able to attend this one.

I wasn’t able to attend this one.

“Small” Data in the Era of “Big” Data
Data minimization reduces the data that can be misused or leaked by deleting it or moving it to more secure storage when it is no longer needed.  People need quick access to the insights from the data, not the raw data itself.  Most people no longer see storage cost as a driver for data minimization, though some do (can be annoying to add storage when maintaining your own secure infrastructure).  A survey by CTRL found that most people say IT should be responsible for the data minimization program.  Legal/compliance should have a role, too.  When a hacker gets into your system, he/she is there for over 200 days on average — lots of time to learn about your data.  Structured data is usually well managed/mapped (85%), but unstructured is not (15%).  Ephemeral technology solves the deletion problem by never storing the data.  Social engineering is one of the biggest ways that data gets out.

Mobile Device Forensics 2020: An FAQ Session Regarding eDiscovery and Data Privacy Considerations for the Coming Year

The Human Mind in the Age of Intelligent Machines
I wasn’t able to attend this one.

# Highlights from Text Analytics Forum 2019

Text Analytics Forum is part of the KMWorld conference. It was held on November 6-7 at the JW Marriott in D.C.. Attendees went to the large KMWorld keynotes in the morning and had two parallel text analytics tracks for the remainder of the day. There was a technical track and an applications track. Most of the slides are available here. My photos, including photos of some slides that caught my attention or were not available on the website, are available here. Since most slides are available online, I have only a few brief highlights below.

Automatic summarization comes in two forms: extracted and generative.  Generative summarization doesn’t work very well, and some products are dropping the feature.  Enron emails containing lies tend to be shorter.  When a customer threatens to cancel a service, the language they use may indicate they are really looking to bargain.  Deep learning works well with data, but not with concepts.  For good results, make use of all document structure (titles, boldface, etc.) — search engines often ignore such details.  Keywords assigned to a document by a human are often unreliable or inconsistent.  Having the document’s author write a summary may be more useful.  Rules work better when there is little content (machine learning prefers more content).  Knowledge graphs, which were a major topic at the conference, are better for discovery than for search.

DBpedia provides structured data from wikipedia for knowledge graphs.  SPARQL is a standardized language for graph databases similar to SQL for relational databases.  When using knowledge graphs, the more connections away the answer is, the more like it is to be wrong.  Knowledge graphs should always start with a good taxonomy or ontology.

Social media text (e.g., tweets) contains a lot of noise.  Some software handles both social media and normal text, but some only really works with one or the other.  Sentiment analysis can be tripped when only looking at keywords.  For example, consider “product worked terribly” to “I’m terribly happy with the product.”  Humans are only 60-80% accurate at sentiment analysis.

# Highlights from Relativity Fest 2019

Relativity Fest celebrated its tenth anniversary at the Hilton in Chicago.  It featured as many as sixteen simultaneous sessions and was attended by about 2,000 people.  You can find my full set of photos here.

The show was well-organized and there were always plenty of friendly staff around to help.  The keynote introduced the company’s new CEO, Mike Gamson.  Various staff members talked about new functionality that is planned for Relativity.  A live demo of the coming Aero UI highlighted its ability to display very large (dozens of MB) documents quickly.

I mostly attended the developer sessions.  During the first full day, the sessions I attended were packed and there were people standing in the back.  It thinned out a bit during the remaining days.  The on-premises version of Relativity will be switching from quarterly releases to annual releases because most people don’t want to upgrade so often.  Relativity One will have updates quarterly or faster.  There seems to be a major push to make APIs more uniform and better documented.  There was also a lot of emphasis on reducing breakage of third party tools with new releases.

# Highlights from IG3 Mid-Atlantic 2019

The first Mid-Atlantic IG3 was held at the Watergate Hotel in Washington, D.C.. It was a day and a half long with a keynote followed by two concurrent sets of sessions.  I’ve provided some notes below from the sessions I was able to attend.  You can find my full set of photos here.

Big Foot, Aliens, or a Culture of Governance: Are Any of Them Real?
In 2012 12% of companies had a chief data officer, but now 63.4% do.  Better data management can give insight into the business.  It may also be possible to monetize the data.  Cigna has used Watson, but you do have to put work into teaching it.  Remember the days before GPS, when you had to keep driving directions in your head or use printed maps.  Data is now more available.

Practical Applications of AI and Analytics: Gain Insights to Augment Your Review or End It Early
Opposing counsel may not even agree to threading, so getting approval for AI can be a problem.  If the requesting party is the government, they want everything and they don’t care about the cost to you.  TAR 2.0 allows you to jump into review right away with no delay for training by an expert, and it is becoming much more common.  TAR 1.0 is still used for second requests [presumably to produce documents without review].  With TAR 1.0 you know how much review you’ll have to do if you are going to review the docs that will potentially be produced, whereas you don’t with TAR 2.0 [though you could get a rough estimate with additional sampling].  Employees may utilize code words, and some people such as traders use unique lingo — will this cause problems for TAR?  It is useful to use unsupervised learning (clustering) to identify issues and keywords.  Negotiation over TAR use can sometimes be more work than doing the review without TAR.  It is hard to know the size of the benefit that TAR will provide for a project in advance, which can make it hard to convince people to use it.  Do you have to disclose the use of TAR to the other side?  If you are using it to cull, rather than just to prioritize the review, probably.  Courts will soon require or encourage the use of TAR.  There is a proportionality argument that it is unreasonable to not use it.  Data volumes are skyrocketing.  90% of the data in the world was created in the last 2 years.

Is There Room for Governance in Digital Transformation?
I wasn’t able to attend this one.

Investigative Analytics and Machine Learning; The Right Mindset, Tools, and Approach can Make all the Difference
You can use e-discovery AI tools to get the investigation going.  Some people still use paper, and the meta data from the label on the box containing the documents may be all you have.  While keyword search may not be very effective, the query may be a starting point for communicating what the person is looking for so you can figure out how to find it.  Use clustering to look for outliers.  Pushing people to use tech just makes them hate you.  Teach them in a way that is relatable.  Listen to the people that are trying to learn and see what they need.  Admit that tech doesn’t always work.  Don’t start filtering the data down too early — you need to understand it first.  It is important to be able to predict things such as cost.  Figure out which people to look at first (tiering).  Convince people to try analytics by pointing out how it can save time so they can spend more time with their kids.  Tech vendors need to be honest about what their products can do (users need to be skeptical).

CCPA and New US Privacy Laws Readiness
I wasn’t able to attend this one.

Ick, Math! Ensuring Production Quality
I moderated this panel, so I didn’t take notes.

Effective Data Mapping Policies and Avoiding Pitfalls in GDPR and Data Transfers for Cross-Border Litigations and Investigations
I wasn’t able to attend this one.

Technology Solution Update From Corporate, Law Firm and Service Provider Perspective
I wasn’t able to attend this one.

Selecting eDiscovery Platforms and Vendors
People often pick services offered by their friends rather than doing an unbiased analysis.  Often do an RFI, then RFP, then POC to see what you really get out of the system.  Does the vendor have experience in your industry?  What is billable vs non-billable?  Are you paying for peer QC?  What does data in/out mean for billing?  Do a test run with the vendor before making any decisions for the long term.  Some vendors charge by the user, instead of, or in addition to, charging based on data volume.  What does “unlimited” really mean?  Government agencies tend to demand a particular way of pricing, and projects are usually 3-5 years.  Charging a lot for a large number of users working on a small database really annoys the customer.  Per-user fees are really a Relativity thing, and other platforms should not attempt it.  Firms will bring data in house to avoid user fees unless the data is too big (e.g., 10GB).  How do dupes impact billing?  Are they charging to extract a dupe?  Concurrent user licenses were annoying, so many moved to named user licenses (typically 4 or 5 to one).  Concurrent licenses may have a burst option to address surges in usage, perhaps setting to the new level.  Some people use TAR on all cases while others in the firm/company never use it, so keep that in mind when licensing it.  Forcing people to use an unfamiliar platform to save money can be a mistake since there may be a lot of effort required to learn it.

eDiscovery Support and Pricing Model — Do we have it all Wrong?
Various pricing models: data in/out + hosting + reviewers, based on number of custodians, or bulk rate (flat monthly fee).  Redaction, foreign language, and privilege logs used to be separate charges, but there is now pressure to include them in the base fee.  Some make processing free but compensate by raising the rate for review.  RFP / procurement is a terrible approach for ediscovery because you work with and need to like the vendor/team.  Ask others about their experience with the vendor, though there is now less variability in quality between the vendors.  Encourage the vendor to make suggestions and not just be an order-taker.  Law firms often blame the vendor when a privileged document is produced, and the lack of transparency about what really happened is frustrating.  The client needs good communication with both the law firm and the vendor.  Law firms shouldn’t offer ediscovery services unless they can do it as well as the vendors (law firms have a fiduciary duty).

Still Looking for the Data
I wasn’t able to attend this one.

Recycling Your eDiscovery Data: How Managing Data Across Your Portfolio can Help to Reduce Wasteful Spending
I wasn’t able to attend this one.

Ready, Fire, Aim!  Negotiating Discovery Protocols
The Mandatory Initial Discovery Pilot Program in the Northern District of Illinois and Arizona requires production within 70 days from filing in order to motivate both sides to get going and cooperate.  One complaint about this is that people want a motion to dismiss to be heard before getting into ediscovery.  Can’t get away with saying “give us everything” under the pilot program since there is not enough time for that to be possible.  Nobody wants to be the unreasonable party under such a tight deadline.  The Commercial Division of the NY Supreme Court encourages categorical privilege logs.  You describe the category, say why it is privileged, and specify how many documents were redacted vs being withheld in their entirety.  Make a list of third parties that received the privileged documents (not a full list of all from/to).  It can be a pain to come up with a set of categories when there is a huge number of documents.  When it comes to TAR protocols, one might disclose the tool used or whether only the inclusive email was produced.  Should the seed set size or elusion set size be disclosed?  Why is the producing party disclosing any of this instead of just claiming that their only responsibility is to produce the documents?  Disclosing may reduce the risk of having a fight over sufficiency.  Government regulators will just tell you to give them everything exactly the way they want it.  When responding to a criminal antitrust investigation you can get in trouble if you standardize the timezone in the data.  Don’t do threading without consent.  A second request may require you to provide a list of all keywords in the collection and their frequencies.  Be careful about orders requiring you to produce the full family — this will compel you to produce non-responsive attachments.

Document Review Pricing Reset
A common approach is hourly pricing for everything (except hosting).  This may be attractive to the customer because other approaches require the vendor to take on risk that the labor will be more than expected and they will build that into the price.  If the customer doesn’t need predictable cost, they won’t want to pay (implicitly) for insurance against a cost overrun.  It is a choice between predictability of cost and lowest cost.  Occasionally review is priced on a per-document basis, but it is hard to estimate what the fair price is since data can vary.  Per-document pricing puts some pressure on the review team to better manage the process for efficiency.  Some clients are asking for a fixed price to handle everything for the next three years. A hybrid model has a fixed monthly fee with a lower hourly rate for review, with the lower hourly review making paying for extra QC review less painful.  Using separate vendors and review companies can have a downside if reviewers sit idle while the tech is not ready.  On the other hand, if there are problems with the reviewers it is nice to have the option to swap them out for another review team.

Finding Common Ground: Legal & IT Working Together
I wasn’t able to attend this one.

# Highlights from EDRM Workshop 2019

The annual EDRM Workshop was held at Duke Law School starting on the evening of May 15th and ending at lunch time on the 17th.  It consisted of a mixture of panels, presentations, working group reports, and working sessions focused on various aspects of e-discovery.  I’ve provided some highlights below.  You can find my full set of photos here.

Herb Roitblat presented a paper on fear of missing out (FOMO).  If 80% recall is achieved, is it legitimate for the requesting party to be concerned about what may have been missed in the 20% of the responsive documents that weren’t produced, or are the facts in that 20% duplicative of the facts found in the 80% that was produced?

A panel discussed the issues faced by in-house counsel.  Employees want to use the latest tools, but then you have to worry about how to collect the data (e.g., Skype video recordings).  How to preserve an iPhone?  What if the phone gets lost or stolen?  When doing TAR, can the classifier/model be moved between cases/clients?  New vendors need to be able to explain how they are unique, they need to get established (nobody wants to be on the cutting edge, and it’s hard to get a pilot going), and they should realize that it can take a year to get approval.  There are security/privacy problems with how law firms handle email.  ROI tracking is important.  Analytics is used heavily in investigations, and often in litigation, but they currently only use TAR for prioritization and QC, not to cull the population before review.  Some law firms are adverse to putting data in the cloud, but cloud providers may have better security than law firms.

The GDPR team is working on educating U.S. judges about GDPR and developing a code of conduct.  The EDRM reference will be made easier to update.  The AI group is focused on AI in legal (e.g., estimating recidivism, billing, etc.), not implications of AI for the law.  The TAR group’s paper is out.  The Privilege Logs group wants to avoid duplicating Sedona’s effort (sidenote: lawyers need to learn that an email is not priv just because a lawyer was CC’ed on it).  The Stop Words team is trying to educate people about things such as regular expressions, and warned about cases where you want to search for a single letter or a term such as “AN” (for ammonium nitrate).  The Proportionality group talked about the possibility of having a standard set of documents that should be produced for certain types of cases and providing guidelines for making proportionality arguments to the court.

A panel of judges said that cybersecurity is currently a big issue.  Each court has it’s own approach to security.  Rule 16 conferences need to be taken seriously.  Judges don’t hire e-discovery vendors, so they don’t know costs.  How do you collect a proprietary database?  Lawyers can usually work it out without the judge.  There is good cooperation when the situations of the parties isn’t too asymmetric.  Attorneys need to be more specific in document requests and objections (no boilerplate).  Attorneys should know the case better than the judge, and educate the judge in a way that makes the judge look good.  Know the client’s IT systems and be aware of any data migration efforts.  Stay up on technology (e.g., Slack and text messages).  Have a 502(d) order (some people object because they fear the judge will assume priv review is not needed, but the judges didn’t believe that would happen).  Protect confidential information that is exchanged (what if there is a breach?).   When filing under seal, “attorney’s eyes only” should be used very sparingly, and “confidential” is over used.

# TAR vs. Keyword Search Challenge, Round 6 (Instant Feedback)

This was by far the most significant iteration of the ongoing exercise where I challenge an audience to produce a keyword search that works better than technology-assisted review (also known as predictive coding or supervised machine learning).  There were far more participants than previous rounds, and a structural change in the challenge allowed participants to get immediate feedback on the performance of their queries so they could iteratively improve them.  A total of 1,924 queries were submitted by 42 participants (an average of 45.8 queries per person) and higher recall levels were achieved than in any prior version of the challenge, but the audience still couldn’t beat TAR.

In previous versions of the experiment, the audience submitted search queries on paper or through a web form using their phones, and I evaluated a few of them live on stage to see whether the audience was able to achieve higher recall than TAR.  Because the number of live evaluations was so small, the audience had very little opportunity to use the results to improve their queries.  In the latest iteration, participants each had their own computer in the lab at the 2019 Ipro Tech Show, and the web form evaluated the query and gave the user feedback on the recall achieved immediately.  Furthermore, it displayed the relevance and important keywords for each of the top 100 documents matching the query, so participants could quickly discover useful new search terms to tweak their queries.  This gave participants a significant advantage over a normal e-discovery scenario, since they could try an unlimited number of queries without incurring any cost to make relevance determinations on the retrieved documents in order to decide which keywords would improve the queries.  The number of participants was significantly larger than any of the previous iterations, and they had a full 20 minutes to try as many queries as they wanted.  It was the best chance an audience has ever had of beating TAR.  They failed.

To do a fair comparison between TAR and the keyword search results, recall values were compared for equal amounts of document review effort.  In other words, for a specified amount of human labor, which approach gave the best production?  For the search queries, the top 3,000 documents matching the query were evaluated to determine the number that were relevant so recall could be computed (the full population was reviewed in advance, so the relevance of all documents was known). That was compared to the recall for a TAR 3.0 process where 200 cluster centers were reviewed for training and then the top-scoring 2,800 documents were reviewed.  If the system was allowed to continue learning while the top-scoring documents were reviewed, the result was called “TAR 3.0 CAL.”  If learning was terminated after review of the 200 cluster centers, the result was called “TAR 3.0 SAL.”  The process was repeated with review of 6,000 documents instead of 3,000 so you can see how much recall improves if you double the review effort.  Participants could choose to submit queries for any of three topics: biology, medical industry, or law.

The results below labeled “Avg Participant” are computed by finding the highest recall achieved by each participant and averaging those values together.  These are surely somewhat inflated values since one would probably not go through so many iterations of honing the queries in practice (especially since evaluating the efficacy of a query would normally involve considerable labor instead of being free and instantaneous), but I wanted to give the participants as much advantage as I could and including all of the queries instead of just the best ones would have biased the results to be too low due to people making mistakes or experimenting with bad queries just to explore the documents.  The results labeled “Best Participant” show the highest recall achieved by any participant (computed separately for Top 3,000 and Top 6,000, so they may be different queries).

 Biology Recall Top 3,000 Top 6,000 Avg Participant 54.5 69.5 Best Participant 66.0 83.2 TAR 3.0 SAL 72.5 91.0 TAR 3.0 CAL 75.5 93.0
 Medical Recall Top 3,000 Top 6,000 Avg Participant 38.5 51.8 Best Participant 46.8 64.0 TAR 3.0 SAL 67.3 83.7 TAR 3.0 CAL 80.7 88.5
 Law Recall Top 3,000 Top 6,000 Avg Participant 43.1 59.3 Best Participant 60.5 77.8 TAR 3.0 SAL 63.5 82.3 TAR 3.0 CAL 77.8 87.8

As you can see from the tables above, the best result for any participant never beat TAR (SAL or CAL) when there was an equal amount of document review performed.  Furthermore, the average participant result for Top 6,000 never beat the TAR results for Top 3,000, though the best participant result sometimes did, so TAR typically gives a better result even with half as much review effort expended.  The graphs below show the best results for each participant compared to TAR in blue.  The numbers in the legend are the ID numbers of the participants (the color for a particular participant is not consistent across topics).  Click the graph to see a larger version.

The large number of people attempting the biology topic was probably due to it being the default, and I illustrated how to use the software with that topic.

One might wonder whether the participants could have done better if they had more than 20 minutes to work on their queries.  The graphs below show the highest recall achieved by any participant as a function of time.  You can see that results improved rapidly during the first 10 minutes, but it became hard to make much additional progress beyond that point.  Also, over half of the audience continued to submit queries after the 20 minute contest, while I was giving the remainder of the presentation.  40% of the queries were submitted during the first 10 minutes, 40% were submitted during the second 10 minutes, and 20% were submitted while I was talking.  Since there were roughly the same number of queries submitted in the second 10 minutes as the first 10 minutes, but much less progress was made, I think it is safe to say that time was not a big factor in the results.

In summary, even with a large pool of participants, ample time, and the ability to hone search queries based on instant feedback, nobody was able to generate a better production than TAR when the same amount of review effort was expended.  It seems fair to say that keyword search often requires twice as much document review to achieve a production that is as good as what you would get TAR.

# Highlights from Ipro Tech Show 2019

Ipro renamed their conference from Ipro Innovations to the Ipro Tech Show this year.  As always, it was held at the Talking Stick Resort in Arizona and it was very well organized.  It started with a reception on April 29th that was followed by two days of talks.  There were also training days bookending the conference on April 29th and May 2nd.  After the keynote on Tuesday morning, there were five simultaneous tracks for the remainder of the conference, including a lot of hands-on work in computer labs.  I was only able to attend a few of the talks, but I’ve included my notes below. You can find my full set of photos here.  Videos and slides from the presentations are available here.

Dean Brown, who has been Ipro’s CEO for eight months, opened the conference with some information about himself and where the company is headed.  He mentioned that the largest case in a single Ipro database so far was 11 petabytes from 400 million documents.  Q1 2019 was the best quarter in the company’s history, and they had a 98% retention rate.  They’ve doubled spending on development and other departments.

Next, there was a panel where three industry experts discussed artificial intelligence.   AI can be used to analyze legal bills to determine which charges are reasonable.  Google uses AI to monitor and prohibit behaviors within the company, such as stopping your account from being used to do things when you are supposed to be away.  Only about 5% of the audience said they were using TAR.  It was hypothesized that this is due to FRCP 26(g)’s requirement to certify the production as complete and correct.  Many people use Slack instead of e-mail, and dealing with that is an issue for e-discovery.  CLOC was mentioned as an organization helping corporations get a handle on legal spending.

The keynote was given by Kevin Surace, and mostly focused on AI.  You need good data and have to be careful about spurious correlations in the data (he showed various examples that were similar to what you find here).  An AI can watch a video and supplement it with text explaining what the person in the video is doing.  One must be careful about fast changing patterns and black swan events where there is no data available to model.  Doctors are being replaced by software that is better informed about the most recent medical research.  AI can review an NDA faster and more accurately than an attorney.  There is now a news channel in China using an AI news anchor instead of a human to deliver the news.  With autonomous vehicles, transportation will become free (supported by ads in the vehicle).  AI will have an impact 100 times larger than the Internet.

I gave a talk titled “Technology: The Cutting Edge and Where We’re Headed” that focused on AI.  I started by showing the audience five pairs of images from WhichFaceIsReal.com and challenged them to determine which face was real and which was generated by an AI.  When I asked if anyone got all five right, I only saw one person raise their hand.  When I asked if anyone got all five wrong, I saw three hands go up.  Admittedly, I picked image pairs that I thought were particularly difficult, but the result is still a little scary.

I also gave a talk titled “TAR Versus Keyword Challenge” where I challenged the audience to construct a keyword search that worked better than technology-assisted review.  The format of this exercise was very different from previous iterations, making it easy for participants to test and hone their queries.  We had 1,924 queries submitted by 42 participants.  They achieved the highest recall levels seen so far, but still couldn’t beat TAR.  A detailed analysis is available here.

# Misleading Metrics and Irrelevant Research (Accuracy and F1)

If one algorithm achieved 98.2% accuracy while another had 98.6% for the same task, would you be surprised to find that the first algorithm required ten times as much document review to reach 75% recall compared to the second algorithm?  This article explains why some performance metrics don’t give an accurate view of performance for ediscovery purposes, and why that makes a lot of research utilizing such metrics irrelevant for ediscovery.

The key performance metrics for ediscovery are precision and recall.  Recall, R, is the percentage of all relevant documents that have been found.  High recall is critical to defensibility.  Precision, P, is the percentage of documents predicted to be relevant that actually are relevant.  High precision is desirable to avoid wasting time reviewing non-relevant documents (if documents will be reviewed to confirm relevance and check for privilege before production).  In other words, precision is related to cost.  Specifically, 1/P is the average number of documents you’ll have to review per relevant document found.  When using technology-assisted review (predictive coding), documents can be sorted by relevance score and you can choose any point in the sorted list and compute the recall and precision that would be achieved by treating documents above that point as being predicted to be relevant.  One can plot a precision-recall curve by doing precision and recall calculations at various points in the sorted document list.

The precision-recall curve to the right compares two different classification algorithms applied to the same task.  To do a sensible comparison, we should compare precision values at the same level of recall.  In other words, we should compare the cost of reaching equally good (same recall) productions.  Furthermore, the recall level where the algorithms are compared should be one that is sensible for for ediscovery — achieving high precision at a recall level a court wouldn’t accept isn’t very useful.  If we compare the two algorithms at R=75%, 1-NN has P=6.6% and 40-NN has P=70.4%.  In other words, if you sort by relevance score with the two algorithms and review documents from top down until 75% of the relevant documents are found, you would review 15.2 documents per relevant document found with 1-NN and 1.4 documents per relevant document found with 40-NN.  The 1-NN algorithm would require over ten times as much document review as 40-NN.  1-NN has been used in some popular TAR systems.  I explained why it performs so badly in a previous article.

There are many other performance metrics, but they can be written as a mixture of precision and recall (see Chapter 7 of the current draft of my book).  Anything that is a mixture of precision and recall should raise an eyebrow — how can you mix together two fundamentally different things (defensibility and cost) into a single number and get a useful result?  Such metrics imply a trade-off between defensibility and cost that is not based on reality.  Research papers that aren’t focused on ediscovery often use such performance measures and compare algorithms without worrying about whether they are achieving the same recall, or whether the recall is high enough to be considered sufficient for ediscovery.  Thus, many conclusions about algorithm effectiveness simply aren’t applicable for ediscovery because they aren’t based on relevant metrics.

One popular metric is accuracy, which is the percentage of predictions that are correct.  If a system predicts that none of the documents are relevant and prevalence is 10% (meaning 10% of the documents are relevant), it will have 90% accuracy because its predictions were correct for all of the non-relevant documents.  If prevalence is 1%, a system that predicts none of the documents are relevant achieves 99% accuracy.  Such incredibly high numbers for algorithms that fail to find anything!  When prevalence is low, as it often is in ediscovery, accuracy makes everything look like it performs well, including algorithms like 1-NN that can be a disaster at high recall.  The graph to the right shows the accuracy-recall curve that corresponds to the earlier precision-recall curve (prevalence is 2.633% in this case), showing that it is easy to achieve high accuracy with a poor algorithm by evaluating it at a low recall level that would not be acceptable for ediscovery.  The maximum accuracy achieved by 1-NN in this case was 98.2% and the max for 40-NN was 98.6%.  In case you are curious, the relationship between accuracy, precision, and recall is:
$ACC = 1 - \rho (1 - R) - \rho R (1 - P) / P$
where $\rho$ is the prevalence.

Another popular metric is the F1 score.  I’ve criticized its use in ediscovery before.  The relationship to precision and recall is:
$F_1 = 2 P R / (P + R)$
The F1 score lies between the precision and the recall, and is closer to the smaller of the two.  As far as F1 is concerned, 30% recall with 90% precision is just as good as 90% recall with 30% precision (both give F1 = 0.45) even though the former probably wouldn’t be accepted by a court and the latter would.   F1 cannot be large at small recall, unlike accuracy, but it can be moderately high at modest recall, making it possible to achieve a decent F1 score even if performance is disastrously bad at the high recall levels demanded by ediscovery.  The graph to the right shows that 1-NN manages to achieve a maximum F1 of 0.64, which seems pretty good compared to the 0.73 achieved by 40-NN, giving no hint that 1-NN requires ten times as much review to achieve 75% recall in this example.

Hopefully this article has convinced you that it is important for research papers to use the right metric, specifically precision (or review effort) at high recall, when making algorithm comparisons that are useful for ediscovery.