Send
Close Add comments:
(status displays here)
Got it! This site "robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website. Note: This appears on each machine/browser from which this site is accessed.
Predictive coding: precision and recall
1. Predictive coding: precision and recall
PC (Predictive Coding) is a technique whose goal is to take a collection of possibly relevant documents and automatically determine which documents are relevant and which are not relevant to the task at hand. This task is often a legal proceeding whereby relevant documents need to be determined.
Names that are often used in place of the term
PC include the following.
TAR (Technology Assisted Review)
eDiscovery
2. Precision and recall
In PC terminology, the following terms are used.
Precision is a ratio of the relevant returned documents in the recalled documents.
Recall is the proportion of documents returned in the search (relevant hits vs. false positives).
This is an important area in the field of
IR (Information Retrieval).
3. Precision and recall
In simple terms, PC attempts to divide the document search space into to parts:
"not recalled" is in "not relevant"
"recalled" as in "relevant"
4. Not recalled
The "not recalled" documents are deemed "not relevant". A good PC system should put on the order of 98% of documents into this category (assuming there is a large category of documents at the start of the process). The false-positives are that subset of the recalled documents that are actually relevant but not identified as so. The goal of PC is to make this subset as small and unimportant as possible.
5. Recalled
The "recalled" documents are deemed "relevant". A good PC system should put on the order of 2% of documents into this category. These documents still need to be reviewed by human reviewers to determine the actual relevancy.
6. PC system
A PC system attempts, for example, to require human review of only 2% of the total documents rather than 100% of the available documents.
In a legal case, there can be a lot of documents and reviewing them can be very time-consuming and thus very expensive.
7. Approaches
The two traditional approaches to PC, which are not mutually exclusive, are the following.
top-down rule-based expert system approach sometimes referred to as knowledge engineering
bottom-up machine learning statistical approaches
In practice, aspects of both approaches can be used in a PC system.
8. Rules-based systems
A rule-based system has some aspect of "knowing" what the documents mean embedded into the rules. In machine learning, one can compare documents without an embedded aspect of "knowing" what the documents mean.
Since machine learning techniques do not "know" what the documents mean, machine learning techniques require a learning set with which to start the comparisons. This is called seeding the system with known relevant documents.
An approach is needed in which to obtain sufficient relevant material in order to create the base for creating a good training sample. One way is to use standard and suitable search query and human review in order to obtain same known relevant documents.
9. Machine learning
There are two types of machine learning (i.e., bottom-up, statistical) systems.
Supervised learning
Unsupervised learning
In practice, aspects of both approaches can be used in a PC system.
10. Decision Support System
Since a complex system such as PC cannot envision every possibility, an actual system would incorporate aspects of a
DSS (Decision Support System) which allows a human to intervene at appropriate points and, given feedback and diagnostics by the program, make "
decisions" for the program to continue.
11. End of page