Predictive coding: precision and recall

Got it! This site "robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website. Note: This appears on each machine/browser from which this site is accessed.

PC (Predictive Coding) is a technique whose goal is to take a collection of possibly relevant documents and automatically determine which documents are relevant and which are not relevant to the task at hand. This task is often a legal proceeding whereby relevant documents need to be determined. Names that are often used in place of the term PC include the following.

TAR (Technology Assisted Review)
eDiscovery

In PC terminology, the following terms are used.

Precision is a ratio of the relevant returned documents in the recalled documents.
Recall is the proportion of documents returned in the search (relevant hits vs. false positives).

This is an important area in the field of IR (Information Retrieval).

In simple terms, PC attempts to divide the document search space into to parts:

"not recalled" is in "not relevant"
"recalled" as in "relevant"

The "not recalled" documents are deemed "not relevant". A good PC system should put on the order of 98% of documents into this category (assuming there is a large category of documents at the start of the process). The false-positives are that subset of the recalled documents that are actually relevant but not identified as so. The goal of PC is to make this subset as small and unimportant as possible.

The "recalled" documents are deemed "relevant". A good PC system should put on the order of 2% of documents into this category. These documents still need to be reviewed by human reviewers to determine the actual relevancy.

A PC system attempts, for example, to require human review of only 2% of the total documents rather than 100% of the available documents.

In a legal case, there can be a lot of documents and reviewing them can be very time-consuming and thus very expensive.

The two traditional approaches to PC, which are not mutually exclusive, are the following.

top-down rule-based expert system approach sometimes referred to as knowledge engineering
bottom-up machine learning statistical approaches

In practice, aspects of both approaches can be used in a PC system.

A rule-based system has some aspect of "knowing" what the documents mean embedded into the rules. In machine learning, one can compare documents without an embedded aspect of "knowing" what the documents mean. Since machine learning techniques do not "know" what the documents mean, machine learning techniques require a learning set with which to start the comparisons. This is called seeding the system with known relevant documents. An approach is needed in which to obtain sufficient relevant material in order to create the base for creating a good training sample. One way is to use standard and suitable search query and human review in order to obtain same known relevant documents.

There are two types of machine learning (i.e., bottom-up, statistical) systems.

Supervised learning
Unsupervised learning

In practice, aspects of both approaches can be used in a PC system.

Since a complex system such as PC cannot envision every possibility, an actual system would incorporate aspects of a DSS (Decision Support System) which allows a human to intervene at appropriate points and, given feedback and diagnostics by the program, make "decisions" for the program to continue.