Machine Learning Essentials:Part 3

Huseyin Baytar
4 min readDec 2, 2023

Unsupervised Learning

If there is no dependent variable or target variable in the dataset of interest meaning there is no information about what resulted from the observations in the respective units, it is unsupervised learning.

source: techvidvan


The objective is to cluster observations based on their similarities. It can be thought of as a classification problem, where similar ones are grouped together but unlike a classification problem, the classes are not predefined in this example; similar ones are formed into a class without predefined categories.

How does it works?

Step 1: The number of clusters is determined.

Step 2: K random centers are selected.

Step 3: Distances to the K centers are calculated for each observation.

Step 4: Each observation is assigned to the nearest center or cluster.

Step 5: After the assignment process, new center calculations are made for the resulting clusters.

Step 6: This process is repeated for a specified number of iterations and the clustering structure of observations where the sum of squared errors within clusters is minimized is selected as the final clustering.

In summary, initially K random centers are chosen distances to these centers are calculated and each observation is assigned to its nearest center forming clusters. Subsequently, center calculations are iteratively performed within each cluster. The process involves repeatedly reassigning observations to the nearest centers and recalculating the centers based on distances. This iteration is repeated a specified number of times. The goal is to make the clusters homogeneous internally while being heterogeneous with respect to each other. This is mathematically determined using metrics such as SSE (Sum of Squared Errors) or SSR (Sum of Squared Residuals). By taking the square of the differences between the values of the observation unit at the center of the cluster and the values of other observation units around it and summing them, the within-cluster sum of squared errors is calculated. To minimize this error, the positions of the cluster centers are adjusted and the center that achieves the lowest SSE is chosen.

Hierarchical Clustering

Our goal is to partition observations into subgroups based on their similarities and this is achieved through Hierarchical Clustering. There are two main approaches: Agglomerative (bottom-up) and Divisive (top-down).

In the Agglomerative method, at the beginning of the process each observation is considered as a single cluster and then as the algorithm progresses pairs of observations are merged into clusters moving upward to create larger clusters. The merging is done iteratively until all observations belong to a single cluster.

On the other hand, in the Divisive method all observations are initially considered as part of a single cluster. As the algorithm proceeds the clusters are divided into smaller clusters in a bottom-up manner until each observation becomes its own cluster.

The main difference from K-Means is that in K-Means we couldn’t intervene externally meaning we had no control over the observations. In Hierarchical Clustering, we can visually inspect and create new clusters through visual techniques drawing lines from certain points.

Principal Component Analysis

The fundamental idea is to represent the main features of multivariate data with fewer variables accepting a small amount of information loss to reduce the dimensionality of the variables.


For instance, in a linear regression problem you may want to overcome the problem of multicollinearity.

In facial recognition problems, there might be a need to apply filters to images or similar reasons where dimensionality reduction is necessary.

Additionally, it is used for visualizing multidimensional data.

PCA reduces the dataset to components expressed as linear combinations of independent variables and therefore there is no correlation between these components. It groups the variances of variable groups into eigenvalues and categorizes the variables in the dataset. The groups with the highest variances are the most important variables.

I Explained everything more detailed with codes and their explanations on my kaggle notebook;

To Be Continued…