Machine Learning Essentials:Part 2
Classification and Regression Tree ( CART )
Decision tree is a method proposed by Leo Breiman in 1982. It forms the basis for random forest and many other commonly used methods. The aim is to convert complex structures in the dataset into simple decision structures. Heterogeneous datasets are divided into homogeneous subgroups based on a specified target variable.
It operates on rules; for example, if a person’s years of experience are greater than 4, the salary is above 520, and if it is less than 4, the salary is below 520. These can have different sub-divisions, for example, if the years of experience are greater than 4 and the salary is above 520, but there can be a further division based on language proficiency. For instance, if the salary is above 520 and the number of languages known is more than 3, then the salary is above 800; if the number of languages known is less than 3, then the salary is below 600.
The points that split the independent variables are called internal node points. For example, in the case mentioned, there are two internal nodes — one for the division based on years of experience and the other on the number of languages known. There are four terminal nodes (leaves) — salaries of 520, 600, and 800, respectively.
Internal nodes represent the decision points, while terminal nodes(leaf node) represent the endpoint values. In the example above, there are two internal nodes and four terminal nodes.