Send Close Add comments: (status displays here)
Got it!  This site "robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website.  Note: This appears on each machine/browser from which this site is accessed.
Data science: decision trees
by RS  admin@robinsnyder.com : 1024 x 640


1. Data science: decision trees
This page looks at the data science concept of decision trees.

A DT (Decision Tree) is a set of algorithms that are part of what is called ML (Machine Learning).

2. Computer trees
In computer science, a tree is a data structure that is a connected graph with no cycles.

3. Math
Note: Many mathematicians use the following terminology for graphs.

4. Expression tree
Expression tree for (X & Y) | ((! X) & (! Y))
Here is a computer science expression tree for the following expression.
( X & Y ) | ( ( ! X ) & ( ! Y ) )

Computer science trees usually have the root at the top and branching downwards as one would draw it on a piece of paper as if writing (top to bottom on the page).

5. Decision tree
Tails-tailsDecision trees are used in data science, business, etc., and often have expected values for branch outcomes based on probabilities and associated costs.

Decision trees usually have the root at the left and branching to the right.

6. Data
There are times when data needs to be grouped and/or partitioned according to some rules that are to be determined. This is where data science decision trees are useful.

7. Rules
Rull treeA DT starts with a decision to be made, then another, etc., until a result at a leaf is reached. Decision science trees are based on data-based rules. From here on, the term decision tree will be used for the type of decision trees used in data science.

Human intuition may be useful for some small problems, but, in general, automated data science techniques are useful for other problems.

8. Data
The data is assumed to be in the form of rows where each row has the independent variables and one result as the dependent variable.

The goal of the DT is, given a set of independent variables, predict the value of the dependent variable.

9. Supervised learning
A decision tree uses a supervised learning process. That is, the answers for the independent values in each data row is known - the dependent value.

Then the following is done.

10. Student example
Suppose data is collected for students in an introductory programming course.

The independent variables are data such as the following. The dependent variable is whether the student passed or failed (or, for more values, the letter grade).

11. Data split
The data can be split into a training set and a test set.

After training, the results can be used to evaluate how well a student might do part way into a course. This would include (by extension) a student who did not actually finish the course.

Such early detection could help by providing a reason for early intervention, etc.

12. Decision science trees
There are two main types of decision science trees and techniques used with those trees. Note: Many continuous variables can be converted to categorical variables though a suitable binning process.


13. Decision tree algorithms
There are many decision tree algorithms and, for each, many variations. Here are some of the more important ones.

14. Operational method
In more detail, this is the process.

15. Gini and gain
(Confusion matrix)

Impurity is a measure of the non-uniform results in the data (of a subtree). A common measure is gini.

16. Overfitting
Overfitting is memorizing the provided example data so that that data is recognized but other data is not recognized as well is it should be recognized.

17. Complex trees
Decision trees can become overly complex.

18. Stability
Decision trees can be unstable in that some data sets may cause the result to vary more than what would be expected.

19. Random forests
Random forests are a way of using ensemble learning to avoid a few pitfalls with decision trees.

For the same reason that hill climbing algorithms use some random stochastic processes to "mix things up", random forests use random stochastic processes to get a better decision tree.

20. End of page

by RS  admin@robinsnyder.com : 1024 x 640