Classification

Sandun Dayananda
4 min readApr 9, 2024

--

Mathematical and machine learning approach

classification

Machine learning algorithms are transforming the way we analyse data and make predictions. One of the fundamental tasks in machine learning is classification, where the goal is to assign predefined labels or categories to input data points based on their features. Classification has widespread applications, from spam detection in emails to medical diagnosis and sentiment analysis. In this article, we will explore the essence of classification in machine learning, understand different algorithms, and provide code samples with plots to illustrate their effectiveness.

Classification is a supervised learning technique that involves mapping input data to predefined classes or categories. The input data, often referred to as features or attributes, are represented as numerical values, and the corresponding labels indicate the class or category to which each data point belongs. The process of classification involves training a model using labelled data to predict the class labels of new, unseen data points.

Let’s dive into some of the widely used classification algorithms. Classification algorithms can be divided into following 3 main types according to their behaviour.

  • Binary Classification
  • Multi-Class Classification
  • Multi-Label Classification

Binary Classification

Binary classification is used for classification tasks which have only two class labels(two classes). There are several algorithms for binary classification. (Note: Logistic regression and SVM supports when there are only two classes. But we can make them work for multi class classification also. Other algorithms can be used when there are more classes)

Logistic regression

Support vector machines(SVM)

KNN(K-Nearest Neighbours)

Decision Trees

Naive Bayes

Logistic Regression:

Despite its name, logistic regression is a linear classification algorithm that models the probability of a data point belonging to a specific class. It applies a logistic function to the linear combination of input features, mapping the continuous output to a binary decision boundary.

Support vector machines(SVM):

Support Vector Machines (SVM) is a powerful classification algorithm. It works by finding the optimal hyperplane that separates different classes in a dataset. The key idea behind SVM is to identify the best decision boundary that maximally separates the data points of different classes while maintaining the largest possible margin.

Decision Trees:

Decision trees create a hierarchical structure of decisions based on feature values to arrive at class predictions. Each internal node represents a decision based on a feature, and each leaf node represents a class label. Decision trees are intuitive and can handle both numerical and categorical features.

KNN(K-Nearest Neighbours):

K-Nearest Neighbours (KNN) is a simple yet powerful algorithm used for both classification and regression tasks in machine learning. It operates on the principle of similarity, where it assigns a label to an instance based on the labels of its k nearest neighbours in the feature space.

In KNN, the training data consists of labelled instances with their corresponding features. During the training phase, the algorithm stores the entire training dataset, making it a lazy learner. When a new unlabelled instance needs to be classified, KNN identifies its k nearest neighbours by measuring the distance (e.g., Euclidean or Manhattan distance) between the instance and all other instances in the training data.

Naive Bayes:

Naive Bayes is a classification algorithm based on Bayes’ theorem, which is a fundamental concept in probability theory. The algorithm assumes that the features are conditionally independent of each other given the class labels, hence the term “naive.”

In simple terms, Naive Bayes calculates the probability of an instance belonging to a particular class by considering the probabilities of its features given that class. It then selects the class with the highest probability as the predicted class for that instance.

Multi Class Classification

Multi class classification algorithms are used when there are more than two class labels(classes) in the classification task. That means Multi class classification algorithms categorize given data in to several groups.

Decision Trees

Random Forest

KNN(K-Nearest Neighbours)

Naive Bayes

Gradient Boosting

Random Forests:

Random forests are an ensemble learning method that combines multiple decision trees to improve prediction accuracy and handle overfitting. The algorithm randomly selects subsets of features and data samples to build each tree, reducing the correlation between trees.

Gradient Boosting:

Gradient Boosting is a powerful machine learning technique used for both classification and regression tasks. It belongs to the family of ensemble methods that combine multiple weak models to create a strong predictive model.

In Gradient Boosting, the idea is to iteratively train a sequence of weak models, typically decision trees, where each subsequent model is built to correct the mistakes made by the previous models. The key idea behind this technique is to focus on the instances that are challenging to classify correctly.

Multi Label Classification

Multi label algorithms are used when one sample from given data falls under several labels. That means the considering sample data can have more than one class(label).

Multi label decision trees

Multi label random forests

Multi label gradient boosting

Evaluating Classification Models

Evaluating the performance of a classification model is crucial to understand its accuracy. Common evaluation metrics are accuracy, precision, recall, and F1-score. Additionally, confusion matrices and receiver operating characteristic (ROC) curves are useful for visualising and analysing model performance.

Handling Imbalanced Data

In real-world scenarios, datasets often suffer from class imbalance, where one class has significantly fewer instances than others. This can lead to biased models. Techniques like oversampling, under sampling, and SMOTE (Synthetic Minority Over-sampling Technique) can be employed to address class imbalance and improve classification results.

--

--

Sandun Dayananda

Big Data Engineer with passion for Machine Learning and DevOps | MSc Industrial Analytics at Uppsala University