Machine Learning Essentials:Part 1

Huseyin Baytar
8 min readNov 21, 2023

--

What is Machine learning?

The scientific field dedicated to developing various algorithms and techniques to enable computers to learn in a manner similar to humans is known as machine learning. Regression is used for numerical problems, whereas problems involving dependent variables are categorized as classification problems

Example of Regression: house prices

Example of Classification:Titanic survived or died.

Supervised Learning:

If our dataset includes labels (target), it is supervised learning. It is based on learning the relationship between dependent and independent variables, aiming to predict the target of new features.

Unsupervised Learning:

If there is no target in the dataset, it is unsupervised learning. Techniques like clustering (segmentation) can be applied.

Reinforcement Learning:

Imagine a robot in an empty room trying to exit. It is reinforcement learning. It involves the robot being punished for each wrong move and learning to navigate the room through trial and error, reinforcing successful actions.

Model Evaluation Metrics for Regression Models

Mean Squared Error (MSE):

This metric assesses the average squared difference between the actual and predicted values.It is commonly used in both model evaluation and optimization methods.Calculation: Summing the squared differences between actual and predicted values and then averaging them.

Root Mean Squared Error (RMSE):

Derived from MSE, it involves taking the square root after MSE calculation. It can be thought of as a metric for reverting back to the original scale, providing a measure of the average magnitude of the error.

Mean Absolute Error (MAE):

MAE evaluates the average absolute differences between actual and predicted values. It is suitable for scenarios where you want to avoid the impact of outliers. Calculation: Summing the absolute differences between actual and predicted values and then averaging them.

Model Evaluation on Classification Models

In classification models, the evaluation metric for success is often expressed as accuracy. Accuracy measures the proportion of correctly classified instances out of the total classified instances. The formula for accuracy is:

“Correct Predictions” refers to the instances that were correctly predicted by the model.

“Total Predictions” is the sum of all instances that the model has attempted to classify.

This metric provides a straightforward measure of the overall performance of a classification model by indicating the percentage of correctly predicted instances among all predictions.

Model Validation Methods

Model validation is the effort to improve the performance of obtained models.

Holdout Method

The original dataset is divided into two parts: a training set and a test set.The model is trained on the training set, and then tested on the separate test set to evaluate its performance.

K-Fold Cross-Validation

The original dataset is divided into k folds (subsets).The model is trained k times, each time using k-1 folds for training and the remaining fold for testing.This process is repeated k times, and the average performance across all folds is used as the overall performance measure. It helps reduce the impact of chance in the data split.

Cross-validation can be used in two ways:

First Method: The original dataset is divided into k folds, and the model is trained and tested k times, each time using a different fold as the test set.

Second Method: The dataset is split into a training set (e.g., 80%) and a test set (e.g., 20%), and then cross-validation is performed on the training set. The model is trained on 80%, and cross-validation is applied, ensuring that the model is tested on parts of the data it has never seen during training.

Bias-Variance Tradeoff

Overfitting

Overfitting occurs when a model learns the data too well, memorizing the specifics of the data rather than generalizing its structure. For example, if a child is given similar exam questions before a test and memorizes them, they might struggle when faced with different questions during the actual test. Overfitting happens when the model memorizes the data instead of learning its underlying patterns.

Underfitting

Underfitting occurs when a model is too simple to capture the underlying patterns in the data. It may not fit some observations well while ignoring others.

How to Address Overfitting

To identify overfitting, one can compare training error and test error. If they start to diverge, it indicates overfitting. Monitoring the change in errors within both the training and test sets helps detect overfitting early. Stopping the training at the point where the training and test errors begin to diverge can mitigate overfitting.

Linear Regression

Our Objective is: To model the relationship between dependent and independent variable(s) linearly.What we aim to learn from the data is the essence or the intrinsic information within it

b0 = beta, bias, intercept

b1 = coefficient, weight

Let’s say we have a multiple linear regression analysis.

For example In this context, bx1 represents the square footage of the building, bx2 represents the age of the building, and bx3 represents the distance of the building to the subway.

bx1 should be positive, bx2 suggests that as the value increases, the price decreases, and bx3 implies that as the value increases, the price decreases

The available data helps learn based on these weights. Let’s say a new house comes in, and when we input the weights of the house, such as the square footage, age, distance, etc., it provides a price estimate based on the learning from previous houses.

Finding Weights

By minimizing the sum/average of the squared differences between actual and predicted values, we can find the values of b (bias/intercept) and w (coefficients/weights).

We define the COST function by adding 2m to the MSE function.

Parameter Estimation

The goal is to find the bias and weight that will minimize the error, typically measured by the mean squared error (MSE). Our objective is to explore possible combinations, aiming to reach the minimum point in the error function.

1 — Analytic Solution: Normal Equations Method (Ordinary Least Squares — OLS):

One way to find the bias and weight is through an analytic solution, using the normal equations method, which is widely used in statistics. However, in multiple linear regression, finding the inverse of the final matrix solution in the least squares method becomes challenging as the number of observations and variables increases.

2 — Optimization Method: Gradient Descent:

An alternative is to use an optimization method like Gradient Descent. This method works by iteratively changing the values of parameters. The iterative adjustment of parameter values helps converge towards the optimal values that minimize the error. Gradient Descent is particularly useful in scenarios where the normal equations method becomes computationally challenging, especially with a large number of observations and variables.

Gradient Descent for Linear Regression

Repeat until convergence: The algorithm iteratively updates the relevant parameter values towards the negative of the gradient obtained from the derivative, aiming to reach the parameter values that minimize the function.

Gradient Descent is employed to find parameter values that minimize the cost function. It iteratively updates the parameter values in the direction of the negative gradient, which is defined as the steepest descent.

Choosing a small learning rate is often preferable. If the learning rate is too large, the algorithm may overshoot the minimum value.

Logistic Regression

The goal is to model the relationship between dependent and independent variables linearly for a classification problem. It is similar to normal regression, but the resulting output is passed through a sigmoid function. This is done to ensure that the predicted final result falls within the range of 0 to 1.

It is achieved by finding the weights that minimize the log loss value associated with the differences between the actual and predicted values.

Evaluation Metrics for Classification Problems

Accuracy: Ratio of correct predictions (tp+tn) / (tp+tn+fp+fn)

Precision: Success rate of positive class predictions TP / ( TP + FP )

Recall: Rate of correct predictions for the positive class TP / (TP + FN)

F1 Score: 2 * (Precision Recall) / (Precision + Recall)

Precision focuses on the success of predictions, while Recall focuses on capturing the positives. Both are important, and the F1 score balances both values.

Classification Threshold

Let’s say our threshold is 0.50. Values higher than the threshold are rounded to 1, and values lower than the threshold are rounded to 0.Depending on the change in the threshold value, accuracy, recall, F1 score, precision, etc., all change. To address this issue, an ROC curve is needed.

ROC Curve

Generating all possible thresholds and calculating the confusion matrix for each. Determination is based on the intersection of the True Positive Rate and False Positive Rate table.

Area Under Curve (AUC)

It is a single numerical expression of the ROC curve. It represents the area under the ROC curve. AUC is an aggregate performance measure for all possible classification thresholds.

When it comes to classification problems, the first thing to pay attention to is whether the class distribution is imbalanced. If it is imbalanced, we look at recall, precision, and F1 score. After that, we examine the AUC values.

K- Nearest Neighbours (KNN)

Prediction is made based on the similarity of observations to each other. (Tell me your friends, and I’ll tell you who you are.)

The k nearest observation units for a given observation unit are calculated. The dependent variable is predicted based on the k nearest observation units. A distance calculation is made for each observation using Euclidean or a similar distance metric. The average of the dependent variable of the k nearest observation units is taken and assigned to our unknown observation.

If we are going to use KNN for classification, instead of the average of the k nearest observation units, we will choose the most frequently repeated class and assign it to that variable.

I Explained everything more detailed with codes and their explanations on my kaggle notebook;

To Be Continued…

--

--