Tag Archives: TAR

Gain Curves

You may already be familiar with the precision-recall curve, which describes the performance of a predictive coding system.  Unfortunately, the precision-recall curve doesn’t (normally) display any information about the cost of training the system, so it isn’t convenient when you want to compare the effectiveness of different training methodologies.  This article looks at the gain curve, which is better suited for that purpose.

The gain curve shows how the recall achieved depends on the number of documents reviewed (slight caveat to that at the end of the article).  Recall is the percentage of all relevant documents that have been found.  High recall is important for defensibility.  Here is an example of a gain curve (click to enlarge):

gain_curve_single

The first 12,000 documents reviewed in this example are randomly selected documents used to train the system.  Prevalence is very low in this case (0.32%), so finding relevant documents using random selection is hard.  The system needs to be exposed to a large enough number of relevant training documents for it to learn what they look like so it can make good predictions for the relevance of the remaining documents.

After the 12,000 training documents are reviewed the system orders the remaining documents to put the ones that are most likely to be relevant (based on patterns detected during training) at the top of the list.  To distinguish the training phase from the review phase I’ve shown the training phase as a solid line and review phase as a dashed line.  Review of the remaining documents starts at the top of the sorted list.  The gain curve is very steep at the beginning of the review phase because most of the documents being reviewed are relevant, so they have a big impact on recall.  As the review progresses the gain curve becomes less steep because you end up reviewing documents that are less likely to be relevant.  Review proceeds until a desired level of recall, such as 75% (the horizontal dotted line), is achieved.  The goal is to find the system and workflow that achieves the recall target at the lowest cost (i.e., the one that crosses the dotted line farthest to the left, with some caveats below).

What is the impact of using the same system with a larger or smaller number of randomly selected training documents?  This figure shows the gain curves for 9,000 and 15,000 training documents in addition to the 12,000 training document curve seen earlier:

gain_curve_multi

If the goal is to reach 75% recall, 12,000 is the most efficient option among the three considered because it crosses the horizontal dotted line with the least document review.  If the target was a lower level of recall, such as 70%, 9,000 training documents would be a better choice.  A larger number of training documents usually leads to better predictions (the gain curve stays steep longer during the review phase), but there is a point where the improvement in the predictions isn’t worth the cost of reviewing additional training documents.

The discussion above assumed that the cost of reviewing a document during the training phase is the same as the cost of reviewing a document during the review phase.  That will not be the case if expensive subject matter experts are used to review the training documents and low-cost contract reviewers are used for the review phase.  In that situation, the optimal result is less straightforward to identify from the gain curve.

In some situations it may be possible produce documents without reviewing them if there is no concern about disclosing privileged documents (because there are none or because they are expected to be easy to identify by looking at things like the sender/recipient email address) or non-relevant documents (because there is no concern about them containing trade secrets or evidence of bad acts not covered by the current litigation).  When it is okay to produce documents without reviewing them, the document review associated with the dashed part of the curve can be eliminated in whole or in part.  For example, documents predicted to be relevant with high confidence may be produced without review (unless they are identified as potential privileged), whereas documents with a lower likelihood of being relevant might be reviewed to avoid disclosing too many non-relevant documents.  Again, the gain curve would not show the optimal choice in a direct way–you would need to balance the potential harm (even if small) of producing non-relevant documents against the cost of additional training.

The predictive coding process described in this article, random training documents followed by review (with no additional learning by the algorithm), is sometimes known as Simple Passive Learning (SPL), which is one example of a TAR 1.0 workflow.  To determine the optimal point to switch from training to review with TAR 1.0, a random set of documents known as a control set is reviewed and used to monitor learning progress by comparing the predictions for the control set documents to their actual relevance tags.  Other workflows and analysis of their efficiency via gain curves will be the subject of my next article.

TAR 3.0 and Training of Predictive Coding Systems (Webinar)

A recording of the webinar described below is available here.

Bill Dimm will workflow_cluster_CALbe giving a webinar on training of predictive coding systems and the new TAR 3.0 workflow (not to be confused with Ralph Losey’s Predictive Coding 3.0).  This webinar should be accessible to those that are new to predictive coding while also providing new insights for those who are more experienced.  Expect lots of graphs and animations.  It will be held on December 10, 2015 at 1:00pm EST, and it will be recorded.  Those who register will be able view the video during the three days following the webinar, so please register even if the live event doesn’t fit into your schedule.

RECORDED WEBINAR

SLIDES FROM WEBINAR

Disclosing Seed Sets and the Illusion of Transparency

There has been a great deal of debate about whether it is wise or possibly even required to disclose seed sets (training documents, possibly including non-relevant documents) when using predictive coding.  This article explains why disclosing seed sets may provide far less transparency than people think.

seed_sproutThe rationale for disclosing seed sets seems to be that the seed set is the input to the predictive coding system that determines which documents will be produced, so it is reasonable to ask for it to be disclosed so the requesting party can be assured that they will get what they wanted, similar to asking for a keyword search query to be disclosed.

Some argue that the seed set may be work product (if attorneys choose which documents to include rather than using random sampling).  Others argue that disclosing non-relevant training documents may reveal a bad act other than the one being litigated.  If the requesting party is a  competitor, the non-relevant training documents may reveal information that helps them compete.  Even if the producing party is not concerned about any of the issues above, it may be reluctant to disclose the seed set due to fear of establishing a precedent it may not want to be stuck with in future cases having different circumstances.

Other people are far more qualified to debate the legal and strategic issues than I am.  Before going down that road, I think it’s worthwhile to consider whether disclosing seed sets really provides the transparency that people think.  Some reasons why it does not:

  1. If you were told that the producing party would be searching for evidence of data destruction by doing a keyword search for “shred AND documents,” you could examine that query and easily spot deficiencies.  A better search might be “(shred OR destroy OR discard OR delete) AND (documents OR files OR records OR emails OR evidence).”  Are you going to review thousands of training documents and realize that one relevant training document contains the words “shred” and “documents” but none of the training documents contain “destroy” or “discard” or “files”?  I doubt it.
  2. You cannot tell whether the seed set is sufficient if you don’t have access to the full document population.  There could be substantial pockets of important documents that are not represented in the seed set–how would you know?  The producing party has access to the full population, so they can do statistical sampling to measure the quality (based on number of relevant documents, not their importance) of the predictions the training set will produce.  The requesting party cannot do that–they have no way of assessing adequacy of the training set other than wild guessing.
  3. You cannot tell whether the seed set is biased just by looking at it.  Again, if you don’t have access to the full population, how could you know if some topic or some particular set of keywords is under or over represented?  If training documents were selected by searching for “shred AND Friday,” the system would see both words on all (or most) of the relevant documents and would think both words are equally good indicators of relevance.  Would you notice that all the relevant training documents happen to contain the word “Friday”?  I doubt it.
  4. Suppose you see an important document in the seed set that was correctly tagged as being relevant.  Can you rest assured that similar documents will be produced?  Maybe not.  Some classification algorithms can predict a document to be non-relevant when it is a near-dupe or even an exact dupe of a relevant training document.  I described how that could happen in this article.  How can you claim that the seed set provides transparency if you don’t even know if a near-dupe of a relevant training document will be produced?
  5. Poor training doesn’t necessarily mean that relevant documents will be missed.  If a relevant document fails to match a keyword search query, it will be missed, so ensuring that the query is good is important.  Most predictive coding systems generate a relevance score for each document, not just a binary yes/no relevance prediction like a search query.  Whether or not the predictive coding system produces a particular relevant document doesn’t depend solely on the training set–the producing party must choose a cutoff point in the ranked document list that determines which documents will be produced.  A poorly trained system can still achieve high recall if the relevance score cutoff is chosen to be low enough.  If the producing party reviews all documents above the relevance score cutoff before producing them, a poorly trained system will require a lot more document review to achieve satisfactory recall.  Unless there is talk of cost shifting, or the producing party is claiming it should be allowed to stop at modest recall because reaching high recall would be too expensive, is it really the requesting party’s concern if the producing party incurs high review costs by training the system poorly?
  6. One might argue that the producing party could stack the seed set with a large number of marginally relevant documents while avoiding really incriminating documents in order to achieve acceptable recall while missing the most important documents.  Again, would you be able to tell that this was done by merely examining the seed set without having access to the full population?  Is the requesting party going to complain that there is no smoking gun in the training set?  The producing party can simply respond that there are no smoking guns in the full population.
  7. The seed set may have virtually no impact on the final result.  To appreciate this point we need to be more specific about what the seed set is, since people use the term in many different ways (see Grossman & Cormack’s discussion).  If the seed set is taken to be a judgmental sample (documents selected by a human, perhaps using keyword search) that is followed by several rounds of additional training using active learning, the active learning algorithm is going to have a much larger impact on the final result than the seed set if active learning contributes a much larger number of relevant documents to the training.  In fact, the seed set could be a single relevant document and the result would have almost no dependence on which relevant document was used as the seed (see the “How Seed Sets Influence Which Documents are Found” section of this article).  On the other hand, if you take a much broader definition of the seed set and consider it to be all documents used for training, things get a little strange if continuous active learning (CAL) is used.  With CAL the documents that are predicted to be relevant are reviewed and the reviewers’ assessments are fed back into the system as additional training to generate new predictions.  This is iterated many times.  So all documents that are reviewed are used as training documents.  The full set of training documents for CAL would be all of the relevant documents that are produced as well as all non-relevant documents that were reviewed along the way.  Disclosing the full set of training documents for CAL could involve disclosing a very large number of non-relevant documents (comparable to the number of relevant documents produced).

Trying to determine whether a production will be good by examining a seed set that will be input into a complex piece of software to analyze a document population that you cannot access seems like a fool’s errand.  It makes more sense to ask the producing party what recall it achieved and to ask questions to ensure that recall was measured sensibly.  Recall isn’t the whole story–it measures the number of relevant documents found, not their importance.  It makes sense to negotiate the application of a few keyword searches to the documents that were culled (predicted to be non-relevant) to ensure that nothing important was missed that could easily have been found.  The point is that you should judge the production by analyzing the system’s output, not the training data that was input.

Using Extrapolated Precision for Performance Measurement

This is a brief overview of my paper “Information Retrieval Performance Measurement Using Extrapolated Precision,” which I’ll be presenting on June 8th at the DESI VI workshop at ICAIL 2015 (slides now available here).  The paper provides a novel method for extrapolating a precision-recall point to a different level of recall, and advocates making performance comparisons by extrapolating results for all systems to the same level of recall if the systems cannot be evaluated at exactly the same recall (e.g., some predictive coding systems produce a binary yes/no prediction instead of a relevance score, so the user cannot select the recall that will be achieved).

High recall (finding most of the relevant documents) is important in e-discovery for defensibility.  High precision is desirable to ensure that there aren’t a lot of non-relevant documents mixed in with the relevant ones (i.e., high precision reduces the cost of review for responsiveness and privilege).  Making judgments about the relative performance of two predictive coding systems knowing only a single precision-recall point for each system is problematic—if one system has higher recall but lower precision for a particular task, is it the better system for that task?

There are various performance measures like the F1 score that combine precision and recall into a single number to allow performance comparisons.  Unfortunately, such measures often assume a trade-off between precision and recall that is not appropriate for e-discovery (I’ve written about problems with the  F1 score before).  To understand the problem, it is useful to look at how F1 varies as a function of the recall where it is measured.  Here are two precision-recall curves, with the one on the left being for an easy categorization task and the one on the right being for a hard task, with the F1 score corresponding to each point on the precision-recall curve superimposed:

f1_compare2If we pick a single point from the precision-recall curve and compute the value of F1 for that point, the resulting F1 is very sensitive to the precision-recall point we choose.  F1 is maximized at 46% recall in the graph on the right, which means that the trade-off between precision and recall that F1 deems to be reasonable implies that it is not worthwhile to produce more than 46% of the relevant documents for that task because precision suffers too much when you push to higher recall.  That is simply not compatible with the needs of e-discovery.  In e-discovery, the trade-off  between precision (cost) and recall required should be dictated by proportionality, not by some performance measure that is oblivious to the value of the case.  Other problems with the F1 score are detailed in the paper.

The strong dependence that F1 has on recall as we move along the precision-recall curve means that it is easy to draw wrong conclusions about which system is performing better when performance is measured at different levels of recall.  This strong dependence on recall occurs because the contours of equal F1 are not shaped like precision-recall curves, so a precision-recall curve will cut across many contours.   In order to have the freedom to measure performance at recall levels that are relevant for e-discovery (e.g., 75% or higher) without drawing wrong conclusions about which system is performing best, the paper proposes a performance measure that has constant-performance contours that are shaped like precision-recall curves, so the performance measure depends much less on the recall level where the measurement is made than F1 does. In other words, the proposed performance measure aims to be sensitive to how well the system is working while being insensitive to the specific point on the precision-recall curve where the measurement is made.  This graph compares the constant-performance contours for F1 to the measure proposed in the paper:

f1_x_contours

Since the constant-performance contours are shaped like typical precision-recall curves, we can view this measure as being equivalent to extrapolating the precision-recall point to some other target recall level, like 75%, by simply finding an idealized precision-recall curve that passes through the point and moving along that curve to the target recall.  This figure illustrates extrapolation of precision measurements for three different systems at different recall levels to 75% recall for comparison:

x_extrapolate

Finally, here is what the performance measure looks like if we evaluate it for each point in the two precision-recall curves from the first figure:

x_compare2

The blue performance curves are much flatter than the red F1 curves from the first figure, so the value is much less sensitive to the recall level where it is measured.  As an added bonus, the measure is an extrapolated estimate of the precision that the system would achieve at 75% recall, so it is inversely proportional to the cost of the document review needed (excluding training and testing) to reach 75% recall.  For more details, read the paper or attend my talk at DESI VI.

Can You Really Compete in TREC Retroactively?

I recently encountered a marketing piece where a vendor claimed that their tests showed their predictive coding software demonstrated favorable performance compared to the software tested in the 2009 TREC Legal Track for Topic 207 (finding Enron emails about fantasy football).  I spent some time puzzling about how they could possibly have measured their performance when they didn’t actually participate in TREC 2009.

One might question how meaningful it is to compare to performance results from 2009 since the TREC participants have probably improved their software over the past six years.  Still, how could you do the comparison if you wanted to?  The stumbling block is that TREC did not produce a yes/no relevance determination for all of the Enron emails.  Rather, they did stratified sampling and estimated recall and prevalence for the participating teams by producing relevance determinations for just a few thousand emails.

Stratified sampling means that the documents are separated into mutually-exclusive buckets called “strata.”  To the degree that stratification manages to put similar things into the same stratum, it can produce better statistical estimates (smaller uncertainty for a given amount of document review).  The TREC Legal Track for 2009 created a stratum containing documents that all participants agreed were relevant.  It also created four strata containing documents that all but one participant predicted were relevant (there were four participants, so one stratum for each dissenting participant).  There were six strata where two participants agreed on relevance, and four strata where only one of the four participants predicted the documents were relevant.  Finally, there was one stratum containing documents that all participants predicted were non-relevant, which was called the “All-N” stratum.  So, for each stratum a particular participant either predicted that all of the documents were relevant or they predicted that all of the documents were non-relevant.  You can view details about the strata in table 21 on page 39 here.  Here is an example of what a stratification might look like for just two participants (the number of documents shown and percentage that are relevant may differ from the actual data):

stratification

A random subset of documents from each stratum was chosen and reviewed so that the percentage of the documents in the stratum that were relevant could be estimated.  Multiplying that percentage by the number of documents in the stratum gives an estimate for the number of relevant documents in the stratum.  Combining the results for the various strata allows precision and recall estimates to be computed for each participant.  How could this be done for a team that didn’t participate?  Before presenting some ideas, it will be useful to have some notation:

N[i] = number of documents in stratum i
n[i] = num docs in i that were assessed by TREC
n+[i] = num docs in i that TREC assessed as relevant
V[i] = num docs in i that vendor predicted were relevant
v[i] = num docs in i that vendor predicted were relevant and were assessed by TREC
v+[i] = num docs in i that vendor predicted were relevant and assessed as relevant by TREC

sampling

To make some of the discussion below more concrete, I’ll provide formulas for computing the number of true positives (TP), false positives (FP), and false negatives (FN).  The recall and precision can then be computed from:

R = TP / (TP + FN)
P = TP / (TP + FP)

Here are some ideas I came up with:

1) They could have checked to see which strata the documents they predicted to be relevant fell into and applied the percentages TREC computed to their data.  The problem is that since they probably didn’t identify all of the documents in a stratum as being relevant the percentage of documents that were estimated to be relevant for the stratum by TREC wouldn’t really be applicable.  If their system worked really well, they may have only predicted that the truly relevant documents from the stratum were relevant.  If their system worked badly, their system may have predicted that only the truly non-relevant documents from the stratum were relevant.  This approach could give estimates that are systematically too low or too high.  Here are the relevant formulas (summing over strata, i):

TP = Sum{ V[i] * n+[i] / n[i] }
FP = Sum{ V[i] * (1 – n+[i]/n[i]) }
FN = Sum{ (N[i] – V[i]) * n+[i] / n[i] }

2) Instead of using the percentages computed by TREC, they could have computed their own percentages by looking at only the documents in the stratum that they predicted were relevant and were reviewed by TREC to give a relevance determination.  This would eliminate the possible bias from approach (1), but it also means that the percentages would be computed from a smaller sample, so the uncertainty in the percentage that are relevant would be bigger.  The vendor didn’t provide confidence intervals for their results.  Here is how the computation would go:

TP = Sum{ V[i] * v+[i] / v[i] }
FP = Sum{ V[i] * (1 – v+[i]/v[i]) }
FN = Sum{ (N[i] – V[i]) * (n+[i] – v+[i]) / (n[i] – v[i]) }

It’s possible that for some strata there would be no overlap between the documents TREC assessed and the documents the vendor predicted to be relevant since TREC typically assessed only about 4% of each stratum for Topic 207 (except the All-N stratum, where they assessed only 0.46%).  This approach wouldn’t work for those strata since v[i] would be 0.  For strata where v[i] is 0, one might use approach (1) and hope it isn’t too wrong.

3) A more sophisticated tweak on (2) would be to use the ratio n+[i]/n[i] from (1) to generate a Bayesian prior probability distribution for the proportion of documents predicted by the vendor to be relevant that actually are relevant, and then use v+[i] and v[i] to compute a posterior distribution for that proportion and use the mean of that distribution instead of v+[i]/v[i] in the computation. The idea is to have a smooth interpolation between using n+[i]/n[i] and using v+[i]/v[i] as the proportion of documents estimated to be relevant, where the interpolation would be closer to v+[i]/v[i] if v[i] is large (i.e., if there is enough data for v+[i]/v[i] to be reasonably accurate).  The result would be sensitive to choices made in creating the Bayesian prior (i.e., how much variance to give the probability distribution), however.

4) They could have ignored all of the documents that weren’t reviewed in TREC (over 500,000 of them) and just performed their predictions and analysis on the 3,709 documents that had relevance assessments (training documents should come from the set TREC didn’t assess and should be reviewed by the vendor to simulate actual training at TREC being done by the participants).  It would be very important to weight the results to compensate for the fact that those 3,709 documents didn’t all have the same probability of being selected for review.  TREC oversampled the documents that were predicted to be relevant compared to the remainder (i.e., the number of documents sampled from a stratum was not simply proportional to the number of documents in the stratum), which allowed their stratification scheme to do a good job of comparing the participating teams to each other at the expense of having large uncertainty for some quantities like the total number of relevant documents.  The prevalence of relevant documents in the full population was 1.5%, but 9.0% of the documents having relevance assessments were relevant.  Without weighting the results to compensate for the uneven sampling, you would be throwing away over half a million non-relevant documents without giving the system being tested the opportunity to incorrectly predict that some of them are relevant, which would lead to an inflated precision estimate.  The expression “shooting fish in a barrel” comes to mind.  Weighting would be accomplished by dividing by the probability of the document having been chosen (after this article was published I learned that this is called the Horvitz-Thompson estimator, and it is what the TREC evaluation toolkit uses), which is just n[i]/N[i], so the computation would be:

TP = Sum{ (N[i]/n[i]) * v+[i] }
FP = Sum{ (N[i]/n[i]) * (v[i] – v+[i]) }
FN = Sum{ (N[i]/n[i]) * (n+[i] – v+[i]) }

Note that if N[i]/n[i] is equal to V[i]/v[i], which is expected to be approximately true since the subset of a stratum chosen for assessment by TREC is random, the result would be equal to that from (2).  If N[i]/n[i] is not equal to V[i]/v[i] for a stratum, we would have the disturbing result that the estimate for TP+FP for that stratum would not equal the number of documents the vendor predicted to be relevant for that stratum, V[i].

5) The vendor could have ignored the TREC relevance determinations, simply doing their own.  That would be highly biased in the vendor’s favor because there would be a level of consistency between relevance determinations for the training data and testing data that did not exist for TREC participants.  At TREC the participants made their own relevance determinations to train their systems and a separate set of Topic Authorities made the final relevance judgments that determined the performance numbers.  To the degree that participants came to different conclusions about relevance compared to the Topic Authorities, their performance numbers would suffer.  A more subtle problem with this approach is that the vendor’s interpretation of the relevance criteria would inevitably be somewhat different from that of TREC assessors (studies have shown poor agreement between different review teams), which could make the classification task either easier or harder for a computer.  As an extreme example, if the vendor took all documents containing the word “football” to be relevant and all other documents to be non-relevant, it would be very easy for a predictive coding system to identify that pattern and achieve good performance numbers.

Approaches (1)-(4) would all give the same results for the original TREC participants because for each stratum they would either have V[i]=0 (so v[i]=0 and v+[i]=0) or they would have V[i]=N[i] (so v[i]=n[i] and v+[i]=n+[i]).  The approaches differ in how they account for the vendor predicting that only a subset of a stratum is relevant.  None of the approaches described are great.  Is there a better approach that I missed? TREC designed their strata to make the best possible comparisons between the participants.  It’s hard to imagine how an analysis could be as accurate for a system that was not taken into account in the stratification process.  If a vendor is tempted to make such comparisons, they should at least disclose their methodology and provide confidence intervals on their results so prospective clients can determine whether the performance numbers are actually meaningful.

Comments on Rio Tinto v. Vale and Sample Size

Judge Peck recently issued an opinion in Rio Tinto PLC v. Vale SA, et al, Case 1:14-cv-03042-RMB-AJP where he spent some time reflecting on the state of court acceptance of technology-assisted review (a.k.a. predictive coding).  The quote that will surely grab headlines is on page 2: “In the three years since Da Silva Moore, the case law has developed to the point that it is now black letter law that where the producing party wants to utilize TAR for document review, courts will permit it.”  He lists the relevant cases and talks a bit about transparency and disclosing seed sets.  It is certainly worth reading.

Both parties in Rio Tinto v. Vale have agreed to disclose all non-privileged documents, including non-responsive documents, from their control sets, seed sets, and training sets. Judge Peck accepts their protocol because they both agree to it, but hints that disclosing seed sets may not really be necessary (p. 6, “…requesting parties can insure that training and review was done appropriately by other means…”).

I find one other aspect of the protocol the litigants proposed to be worthy of comment.  They make a point of defining a “Statistically Valid Sample” on p. 11 to be one that gives +/- 2% margin of error at 95% confidence, and even provide an equation to compute the sample size in footnote 2.  Their equation gives a sample size of at most 2,395 documents, depending on prevalence.  They then use the “Statistically Valid Sample” term in contexts where it isn’t (as they’ve defined it) directly appropriate.  I don’t know if this is just sloppiness (missing details about what they actually plan to do) or a misunderstanding of statistics.

For example, section 4.a.ii on p. 13 contemplates culling before application of predictive coding, and says they will “Review a Statistically Valid Sample from the Excluded Documents.”  Kudos to them for actually measuring how many relevant documents they are culling instead of just assuming that keyword search results should be good enough without any analysis, but 2,395 documents is not the right sample size.  The more documents you are culling, the more precisely you need to know what proportion of them were relevant in order to have a reasonably precise value for the number of relevant documents culled, which is what matters for computing recall.  In other words, a +/- 2% measurement on the culled set does not mean +/- 2% for recall.  I described a similar situation in more detail in my Predictive Coding Confusion article under the heading “Beware small percentages of large numbers.”  My eRecall: No Free Lunch article also discusses similar issues.

Section 4.b on p. 13 says that the control set will be a Statistically Valid Sample that will be used to measure prevalence.  They explain in a separate letter to Judge Peck on p. 9 that the control set will be used to track progress by estimating precision and recall. Do they intend to use 2,395 (or fewer) documents for the control set?  Suppose only one of the 2,395 documents is actually relevant.  That would give a prevalence estimate of 0.0011% to 0.2321% with 95% confidence (via this calculator), which is certainly better than the required +/- 2%, but it is useless for tracking progress because the uncertainty is huge compared to the value itself.  If they had a million documents, the estimate would tell them that somewhere between 11 and 2,321 of them are relevant.  So, if they found 11 relevant documents with their predictive coding software they would estimate that they achieved somewhere between 0.5% and 100% recall.  To look at it a little differently, if they looked at their system’s prediction for the control set they would find that it either correctly predicted that the one relevant document was relevant (100% recall) or they would find that it was predicted incorrectly (0% recall), with dumb luck being a big factor in which result they got.

Maybe they intended that the control set contain 2,395 relevant documents, which would give a recall estimate accurate to +/- 2% with 95% confidence (more precise than really seems worthwhile for a control set) by measuring the percentage of relevant documents in the control set that are predicted correctly.  If prevalence is 10%, the control set would need to contain about 23,950 documents to have 2,395 that are relevant.  If prevalence is 1%, the control set would require about 239,500 documents.  That sure seems like a lot of documents to review just to create a control set.  The point is that it is the number of relevant documents in the control set, not the number of documents, that determines how precisely the control set can measure recall.  Their protocol does say that the requesting party will have ten business days to check the control set if it is more than 4,000 documents, so it does seem that they’ve contemplated the possibility of using more than 2,395 documents in the control set, but the details of what they are really planning to do are missing.  Of course, the control set is there to help the producing party optimize their process, so it is their loss if they get it wrong (assuming there is separate testing that would detect the problem, as described in section 4.f).

Finally, section 4.f on p. 16 talks about taking a Statistically Valid Sample from the documents that are predicted to be non-relevant to estimate the number of relevant documents that were missed by predictive coding, leading to a recall estimate.  This has the same problem as the culling in section 4.a.ii — the size of the sample that is required to achieve a desired level of uncertainty in the recall depends on the size of the set of documents being culled, whether the culling is due to keyword searching before applying predictive coding or whether the culling is due to discarding documents that the predictive coding system predicts are non-relevant.

If the goal is to arrive at a reasonably precise estimate of recall (and, I’m certainly not arguing that +/- 2% should be required), it is important to keep track of how the uncertainty from each sample propagates through to the final recall result (e.g. it may be multiplied by some large number of culled documents) when choosing an appropriate sample size.  I may be nitpicking, but it strikes me as odd to lay out a specific formula for calculating sample size and then not mention that it cannot be applied directly for the sampling that is actually being contemplated.