Send
Close Add comments:
(status displays here)
Got it! This site "robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website. Note: This appears on each machine/browser from which this site is accessed.
Data science: decision trees
1. Data science: decision trees
2. Computer trees
In computer science, a tree is a data structure that is a connected graph with no cycles.
A rooted tree is a tree with one node designated as a root.
A tree has nodes and leaves.
The branches are called edges.
3. Math
Note: Many mathematicians use the following terminology for graphs.
Nodes are vertices
Edges are arcs
4. Expression tree
Here is a computer science expression tree for the following expression.
( X & Y ) | ( ( ! X ) & ( ! Y ) )
Computer science trees usually have the root at the top and branching downwards as one would draw it on a piece of paper as if writing (top to bottom on the page).
5. Decision tree
Decision trees are used in data science, business, etc., and often have expected values for branch outcomes based on probabilities and associated costs.
Decision trees usually have the root at the left and branching to the right.
6. Data
There are times when data needs to be grouped and/or partitioned according to some rules that are to be determined. This is where data science decision trees are useful.
7. Rules
A
DT starts with a decision to be made, then another, etc., until a result at a leaf is reached.
Code-based rules are inflexible and hard to change (like code in general).
Data-based rules are flexible and easy to change (like data in general).
Decision science trees are based on data-based rules. From here on, the term decision tree will be used for the type of decision trees used in data science.
Human intuition may be useful for some small problems, but, in general, automated data science techniques are useful for other problems.
8. Data
The data is assumed to be in the form of rows where each row has the independent variables and one result as the dependent variable.
The goal of the DT is, given a set of independent variables, predict the value of the dependent variable.
9. Supervised learning
A decision tree uses a supervised learning process. That is, the answers for the independent values in each data row is known - the dependent value.
Then the following is done.
Develop the tree on part of the data - the training set.
Test the tree on part of rest of the data - the test set
10. Student example
Suppose data is collected for students in an introductory programming course.
The independent variables are data such as the following.
Time started an assignment or lab.
Time submitted an assignment or lab.
Number of practice questions taken per exam or quiz question.
Amount of time taken per question on exam or quiz.
... and so on ...
The dependent variable is whether the student passed or failed (or, for more values, the letter grade).
11. Data split
The data can be split into a training set and a test set.
After training, the results can be used to evaluate how well a student might do part way into a course. This would include (by extension) a student who did not actually finish the course.
Such early detection could help by providing a reason for early intervention, etc.
12. Decision science trees
There are two main types of decision science trees and techniques used with those trees.
For categorical (integer) variables: classification tree
For continuous (floating point ) variables: regression tree
Note: Many continuous variables can be converted to categorical variables though a suitable binning process.
13. Decision tree algorithms
14. Operational method
In more detail, this is the process.
Build the tree:
Prepare the data
Split the data (training and test)
Train the classifier (or regressor)
Use the tree:
Make predictions (using test)
Determine the accuracy (accuracy score)
Adjust as needed
15. Gini and gain
Impurity - non-uniform results (leaves) of data (subtree)
Gini index - to predict impurity as to how an item might not be correctly classified (lower is better)
Entropy - measure of uncertainty
Information gain - via entropy change
(Confusion matrix)
Impurity is a measure of the non-uniform results in the data (of a subtree). A common measure is gini.
16. Overfitting
Overfitting is memorizing the provided example data so that that data is recognized but other data is not recognized as well is it should be recognized.
17. Complex trees
Decision trees can become overly complex.
18. Stability
Decision trees can be unstable in that some data sets may cause the result to vary more than what would be expected.
19. Random forests
Random forests are a way of using ensemble learning to avoid a few pitfalls with decision trees.
For the same reason that hill climbing algorithms use some random stochastic processes to "mix things up", random forests use random stochastic processes to get a better decision tree.
20. End of page