Machine Learning Essentials:Part 2

Huseyin Baytar
9 min readNov 25, 2023

Classification and Regression Tree ( CART )

Decision tree is a method proposed by Leo Breiman in 1982. It forms the basis for random forest and many other commonly used methods. The aim is to convert complex structures in the dataset into simple decision structures. Heterogeneous datasets are divided into homogeneous subgroups based on a specified target variable.

It operates on rules; for example, if a person’s years of experience are greater than 4, the salary is above 520, and if it is less than 4, the salary is below 520. These can have different sub-divisions, for example, if the years of experience are greater than 4 and the salary is above 520, but there can be a further division based on language proficiency. For instance, if the salary is above 520 and the number of languages known is more than 3, then the salary is above 800; if the number of languages known is less than 3, then the salary is below 600.

The points that split the independent variables are called internal node points. For example, in the case mentioned, there are two internal nodes — one for the division based on years of experience and the other on the number of languages known. There are four terminal nodes (leaves) — salaries of 520, 600, and 800, respectively.

Internal nodes represent the decision points, while terminal nodes(leaf node) represent the endpoint values. In the example above, there are two internal nodes and four terminal nodes.

The decision tree structure for a regression problem

if predictor A >= 1.7 then
if Predictor B >= 202.1 then outcome = 1.3
else outcome = 5.6
else outcome = 2.5

When using a decision tree, it provides decision rules. For example, it states that if predictor A is greater than or equal to 1.7, and if Predictor B is greater than or equal to 202.1, then the predicted outcome is 1.3. If not, the predicted outcome is 5.6. Additionally, if predictor A is less than 1.7, then the predicted outcome is 2.5.

If it’s so straightforward, why don’t we do this in Excel, Python, or SQL?

This is because when determining these decision rules algorithmically, we optimize ourselves by moving according to certain reference points to establish what these decision rules are. While we may manually decide to split at 1.7, the question is why did we choose 1.7? We use some decision rule metrics such as chi-square, Gini, Entropy, SSE, etc., to automatically split these subgroups. Our algorithms perform these splits based on these metrics.

What is Cost Function for Regression Problems?

The cost function for regression problems is the RSS (Residual Sum of Squares) function.

The answer to the question of how to build a tree is that the regions, leaves, or boxes where the RSS/SSE value is minimum are the points where the tree should be split. The actual values and the corresponding indicate the regions/leaves/boxes. The tree is split in all possible directions and the locations with the smallest errors i.e. SSE values are considered, completing the first branching. After splitting into two branches, each branch is treated as a separate set of data and the branching process is repeated.

The model doesn’t know how many times to perform the splitting process so we need to provide parameters. For example, if there are only two values left in a branch splitting them may lead to overfitting as the model would memorize the data. If not split, it preserves randomness. Max depth and min samples split parameters are crucial for controlling this behavior.

What is Cost Function for Classification Problems?

Gini and entropy are metrics we can use. We have real classes and predicted classes and these metrics evaluate the situation between them, providing information about our performance. They are purity measures and the lower they are, the better.

Entropy is a widely used value in various disciplines. The aspect we are interested in is ‘the lower the entropy, the better.’ We expect our predicted values to be not only accurate but also diverse and low in quantity.

Consider a branch where we have various predictions and two classes. Let’s say one class has a certain probability of occurrence multiplied by the logarithm of that probability. The lower the diversity of these values the better.

Gini starts from class K1 up to the total number of classes for example, the occurrence probability of class 0 multiplied by the probability of non-occurrence + the occurrence probability of class 1 multiplied by the probability of non-occurrence. Essentially, there is a situation similar to entropy here; the higher the probability of occurrence of one class, the lower the probability of occurrence of the other class

Random Forest

Random Forest is one of the methods created by Leo Breiman in 2001.Its foundation is based on aggregating predictions generated by multiple decision trees. It aims to address the overfitting tendency of the CART method by combining the predictions of multiple trees in Random Forest. Each tree is asked for its opinion, and the ensemble takes the majority of their predictions.It is a combination of Bagging and the random subspace method.Observations for trees are selected using the bootstrap random sampling method and variables are selected using the random subspace method.In each node of the decision tree the differentiating variable with the best split (information gain) is selected among a randomly chosen subset of variables, which is a smaller set compared to all variables.

When we talk about the randomness in Random Forest, two things come to mind:

1- It creates N trees by randomly selecting observations.

2- Before starting the split operations in the created trees, it randomly selects variables from the available variables in a smaller set.

In tree construction, two-thirds of the dataset is used, and the remaining data is used for evaluating the performance of the trees and determining variable importance. Random variable selection is made at each node.

Leo Breiman, in his method introduced in 1996, essentially applied what he did in the CART method to multiple trees. He probably thought that by making random selections in observations, he captured some randomness, so he later used the Random Subspace method, thinking that making random selections in variables would also be successful. By introducing randomness in both observations and variables, he created Random Forest.

Let’s say we have m observations, a random selection process is performed from these observations, selecting n observations (where n < m) using the bootstrap method. In other words, if there are 1000 observations, less than 750 are selected. The trees are independent of each other.

In the Bagging method, trees are not dependent on each other; in the Boosting method, trees are built on residuals, so there is dependence among the trees. Bagging is a model resistant to overfitting


GBM (Gradient Boosting Machine) is a tree-based method that operates based on optimization. It combines the boosting method with the application of gradient descent to tree methods.

AdaBoost (Adaptive Boosting) is a method on which the foundations of GBM are based. AdaBoost relies on the idea of weak classifiers coming together to create a strong classifier.

In GBM, a series of models in the form of a single predictive model is built on errors/residuals. After trees make predictions, boosting is done to correct their errors. A model in the series is built on the residuals of the previous model in the series.GBM uses the gradient descent algorithm, which can optimize any differentiable loss function. The series of models, each in the form of a single predictive model, is constructed in an additive manner.

What is additive modeling?

We can think of tree methods as dividing a box by drawing lines. In additive modeling, we shape this drawn line more precisely, allowing for more successful predictions. We aim to approach the true value by adding and subtracting the remaining errors from our old prediction result.


XGBoost is an optimized, scalable, and platform-integrated version of Gradient Boosting Machine (GBM) designed to enhance speed and prediction performance. It was introduced by Tianqi Chen in 2014 and has proven its success in various Kaggle competitions.


LightGBM is another GBM type that gained recognition in the literature and has proven its success in many competitions. Developed by Microsoft in 2017, LightGBM is designed to enhance the training time and performance of XGBoost.It employs a leaf-wise growth strategy instead of the level-wise growth strategy, leading to faster training times. While XGBoost performs an extensive search, LightGBM conducts a more in-depth search, resulting in improved performance.


CatBoost is the abbreviated form of “Category Boosting.” It was developed by Yandex in 2017. CatBoost is another variant of Gradient Boosting Machines (GBM) that can automatically handle categorical variables. It is known for its speed and success in dealing with categorical features

Imbalanced Datasets

Imbalanced datasets are observed in classification problems when class distributions are not close to each other. The problem arises from the dominance of the majority class over the minority class. The created model tends to be biased toward the majority class, leading to poor classification of the minority class.

Choosing the Right Metrics

Precision: It shows how many of the positively predicted values are actually positive. If precision is low, it indicates a high number of false positives.

Recall: It shows how many of the values that should be predicted as positive are actually predicted as positive. If recall is low, it indicates a high number of false negatives. It should be as high as possible.

F1 Score: Comparing two models in cases where one has low precision and high recall or vice versa can be challenging. F1 score helps make the comparison more comparable by measuring precision and recall at the same time. It represents the harmonic mean of precision and recall.

Support: It represents the number of true values for each class. It can show the imbalance in the number of observations between classes, indicating structural weaknesses in measurements.

ROC Curve: The ROC curve is a graph that shows the performance of a classification model at all classification thresholds.

AUC (Area under the ROC Curve): It summarizes the ROC curve with a single number. It measures the entire two-dimensional area under the ROC curve from (0,0) to (1,1). The best value is 1, and the worst value is 0.5.


Resampling is the process of balancing the dataset by either adding new samples to the minority class or removing samples from the majority class.


Balancing the dataset by duplicating samples from the minority class.

Random Oversampling: Balancing the dataset by adding randomly selected samples from the minority class. This technique can be used when the dataset is small. It may lead to overfitting.

SMOTE Oversampling: Generating synthetic samples from the minority class to prevent overfitting. First, a random sample is selected from the minority class. Then, k nearest neighbors are found for this sample. One of the k nearest neighbors is randomly selected, and a synthetic sample is created by connecting it with the randomly selected sample from the minority class to form a line segment in the feature space.


Balancing the dataset by removing samples from the majority class. It can be useful when dealing with a large dataset. Since our dataset is not large, efficient results may not be achieved, but I will explain the methods and demonstrate how some of them can be applied.

Random Undersampling: Removed samples are randomly selected. This technique can be used when dealing with a large dataset. Information loss may occur due to random selection.

NearMiss Undersampling: Prevents information loss. Based on the KNN algorithm. Computes the distance between samples from the majority class and samples from the minority class. Samples with short distances are retained based on the specified k value.

Tomek links: Increases the gap between two classes by removing majority class samples between the nearest neighbors belonging to different classes.

Cluster Centroids: Removes irrelevant samples from the dataset. The importance of samples is determined through clustering, distinguishing between significant and insignificant samples. Combining undersampling and oversampling techniques can be used to create more balanced datasets.

I Explained everything more detailed with codes and their explanations on my kaggle notebook;

To Be Continued…