Data science: decision trees

Got it! This site "robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website. Note: This appears on each machine/browser from which this site is accessed.

This page looks at the data science concept of decision trees.

A DT (Decision Tree) is a set of algorithms that are part of what is called ML (Machine Learning).

In computer science, a tree is a data structure that is a connected graph with no cycles.

A rooted tree is a tree with one node designated as a root.
A tree has nodes and leaves.
The branches are called edges.

Note: Many mathematicians use the following terminology for graphs.

Nodes are vertices
Edges are arcs

Here is a computer science expression tree for the following expression.

( X & Y ) | ( ( ! X ) & ( ! Y ) )

Computer science trees usually have the root at the top and branching downwards as one would draw it on a piece of paper as if writing (top to bottom on the page).

Decision trees are used in data science, business, etc., and often have expected values for branch outcomes based on probabilities and associated costs.

Decision trees usually have the root at the left and branching to the right.

There are times when data needs to be grouped and/or partitioned according to some rules that are to be determined. This is where data science decision trees are useful.

A DT starts with a decision to be made, then another, etc., until a result at a leaf is reached.

Code-based rules are inflexible and hard to change (like code in general).
Data-based rules are flexible and easy to change (like data in general).

Decision science trees are based on data-based rules. From here on, the term decision tree will be used for the type of decision trees used in data science.

Human intuition may be useful for some small problems, but, in general, automated data science techniques are useful for other problems.

The data is assumed to be in the form of rows where each row has the independent variables and one result as the dependent variable.

The goal of the DT is, given a set of independent variables, predict the value of the dependent variable.

A decision tree uses a supervised learning process. That is, the answers for the independent values in each data row is known - the dependent value.

Then the following is done.

Develop the tree on part of the data - the training set.
Test the tree on part of rest of the data - the test set

Suppose data is collected for students in an introductory programming course.

The independent variables are data such as the following.

Time started an assignment or lab.
Time submitted an assignment or lab.
Number of practice questions taken per exam or quiz question.
Amount of time taken per question on exam or quiz.
... and so on ...

The dependent variable is whether the student passed or failed (or, for more values, the letter grade).

The data can be split into a training set and a test set.

After training, the results can be used to evaluate how well a student might do part way into a course. This would include (by extension) a student who did not actually finish the course.

Such early detection could help by providing a reason for early intervention, etc.

There are two main types of decision science trees and techniques used with those trees.

For categorical (integer) variables: classification tree
For continuous (floating point ) variables: regression tree

Note: Many continuous variables can be converted to categorical variables though a suitable binning process.

There are many decision tree algorithms and, for each, many variations. Here are some of the more important ones.

ID3 (extension of D3)
C4.5 (successor of ID3)
CART (Classification And Regression Tree)
MARS (Multivariate Adaptive Regression Splines)

In more detail, this is the process.

Build the tree:

Prepare the data
Split the data (training and test)
Train the classifier (or regressor)

Use the tree:

Make predictions (using test)
Determine the accuracy (accuracy score)
Adjust as needed

Impurity - non-uniform results (leaves) of data (subtree)
Gini index - to predict impurity as to how an item might not be correctly classified (lower is better)
Entropy - measure of uncertainty
Information gain - via entropy change

(Confusion matrix)

Impurity is a measure of the non-uniform results in the data (of a subtree). A common measure is gini.

Overfitting is memorizing the provided example data so that that data is recognized but other data is not recognized as well is it should be recognized.

Decision trees can become overly complex.

Decision trees can be unstable in that some data sets may cause the result to vary more than what would be expected.

Random forests are a way of using ensemble learning to avoid a few pitfalls with decision trees.

For the same reason that hill climbing algorithms use some random stochastic processes to "mix things up", random forests use random stochastic processes to get a better decision tree.