Validating the Validation Set

Legaltech News
5 minute read | October.01.2016

Predictive coding is becoming increasingly prevalent in fulfilling discovery obligations in litigation and in response to regulatory inquiries. As the process gains acceptance, parties, regulators and courts debate whether producing parties should be required to disclose documents and coding decisions used to “train” the predictive coding software.

However, the focus on these training materials, known as the “seed set,” has shifted attention away from the more important subset of documents known as the “validation set.” The validation set, which essentially functions as an answer key, ultimately ensures the quality of the predictive coding results and should be the focus of parties, courts and regulators in determining whether a party utilizing predictive coding has satisfied its discovery obligations.

The Importance of Predictive Coding

Predictive coding relies on an algorithm to code documents based on input received from human reviewers. While there are various ways to implement predictive coding, the process generally involves two separate subsets of the document collection. One is the seed set, which can be created randomly from judgmental sampling or from searches designed to capture the most relevant documents. The other, the validation set, should be a statistically significant random sample of the document collection.

Reviewers manually determine whether the documents in both subsets are relevant. Based on information gleaned from the seed set documents, the software predicts whether each of the remaining documents in the overall population, including the validation set, is relevant. The accuracy of the software’s predictions is then assessed by comparing its results to the manual determinations for each document in the validation set.

Originally published in Legaltech News; reprinted with permission