Logistic Regression

13 min readJul 22, 2023

Mathematical and machine learning approach

Linear regression is a fundamental tool for predicting continuous outcomes, but what if our problem involves classifying data into discrete categories? Then we have to use Logistic regression. Logistic regression is a powerful technique that extends the capabilities of linear regression to classify data into groups. In this article, we will explore the derivation of logistic regression from linear regression with the mathematical background.

Let’s consider following graph.

Here I have plotted the regression line(red colour) for a dataset. Let’s assume we have a requirement to allocate these data points into two groups based on the regression line considering their location whether over the regression line or below the regression line. Then we can recolour the data points to see the two groups more clearly like in the following image.

Now we can see both groups separately. Let’s assume, what if we want think like ‘need to assign a single higher value for all the blue data points and need to assign a single low value for all the orange data points’. This is something like putting burdens on orange data points to push them to the feet and attaching air balloons to blue data points to push them up and float on the surface of a tank.

Now, all the blue data points have a same higher value and all the orange data points have a same lower value. The process should looks like in the following picture.

What if we plot the data points in a graph and draw a linear regression line? It should looks like following picture.

Now, we have the idea of how the logistic regression works. Let’s see which kind of mathematical operation is used to classify data in to two levels.

Probability

If P is the probability of an event happening, then 1-P is event not happening.

In probability and statistics, odds are a way of expressing the likelihood or probability of an event occurring. In the context of logistic regression, odds represent the ratio of the probability of a certain outcome to the probability of the complementary outcome. Regarding above explanation, we can mathematically use it like following.

Transforming Linear Regression to Logistic Regression

For a linear regression model, the equation is,

In linear regression, we assume a linear relationship between the dependent variable (y) and the independent variables or features (x1, x2, …, xn). The coefficients (β0, β1, β2, . . . , βn) represent the slopes of the linear relationship.

In statistics and regression analysis, the response variable (Y) is the variable we want to predict or model. In some cases, the response variable Y may be categorical, such as binary (e.g., yes/no) or multi-class (e.g., low/medium/high). However, most regression models are designed to work with continuous response variables. we need to use a link function to transform it into a continuous variable suitable for regression modelling.

Let’s consider a binary categorical response variable Y, which can take two values: 0 and 1. Instead of using Y directly as the response, we use the logit of Y, denoted as g(Y), in the regression equation.

The logit function is defined as the logarithm of the odds of the positive outcome:

g(Y) = ln(Y / (1 — Y))

In logistic regression, we assume that the logit of Y is linearly related to the predictors:

g(Y) = β0 + β1X1 + β2X2 + … + βnXn

Here, β0, β1, β2, …, βn are the regression coefficients, and X1, X2, …, Xn are the predictors.

By using the logit of Y as the response variable, we can model the linear relationship between the predictors and the log-odds of the positive outcome. The regression coefficients (β0, β1, β2, …, βn) estimate the effect of each predictor on the log-odds, indicating how the predictors contribute to the probability of the positive outcome.

Through this transformation, we can apply traditional regression techniques to estimate the coefficients and make predictions based on the log-odds of the positive outcome. Now, we can do the implementation simply as following.

Start with the equation:

Apply the exponential function to both sides to eliminate the logarithm and follow the general mathematics:

Here the final equation is called as Sigmoid function(Logistic function or inverse logit). Also we can consider it as the hypothesis function(h(X)=p)

Following is the generalized equation for sigmoid function(σ(z)=h(X)=p)

Here z = y

In the logistic regression we classify data using this Sigmoid function. Now we have the full understanding of the logistic regression and how it works.

Decision Boundary

The decision boundary is a concept used in machine learning to separate different classes or categories in a classification problem. It is a mathematical representation of the dividing line or surface that separates the data points belonging to different classes.

The sigmoid function, specifically the logistic sigmoid function, is often used as an activation function in binary classification problems. It maps the input values to a range between 0 and 1, which can be interpreted as probabilities.

In the context of classification, the sigmoid function is typically used to transform a linear combination of features and their corresponding weights into a probability value. If the resulting probability is above a certain threshold (often 0.5), the data point is classified into one class, and if it is below the threshold, it is classified into the other class.

The decision boundary, in this case, is the line or surface in the input space where the probability of belonging to one class is equal to the probability of belonging to the other class. In a binary classification problem, this decision boundary is often at a probability threshold of 0.5.

In other words, when the sigmoid function output is above 0.5, the data point is classified as one class, and when the sigmoid function output is below 0.5, the data point is classified as the other class. The decision boundary is the boundary in the input space where the sigmoid function output is exactly 0.5.

It’s important to note that the relationship between the decision boundary and the sigmoid function is not always linear. The decision boundary can be linear or nonlinear depending on the complexity of the problem and the model used. In more complex cases, where a linear decision boundary is insufficient, other activation functions or more complex models may be used to capture nonlinear decision boundaries.

Cost Function(loss function), Maximum Likelihood and Gradient Descent

Cost function

The cost function, also known as the loss function, is used to evaluate the performance of the logistic regression model. In logistic regression we use log loss(binary cross-entropy) as the cost function. It quantifies the discrepancy between the predicted probabilities and the actual outcomes.

Cost function is derived from the negative log-likelihood function, which is the logarithm of the likelihood function.

The goal is to minimize the cost function, indicating a better fit of the model to the data(minimize the cross-entropy loss to find the optimal parameters for the logistic regression model).

Mathematically, the cross-entropy loss function for binary logistic regression is given by:

Here, J(θ) represents the cross-entropy loss, m is the number of data points, yi is the actual binary outcome (0 or 1) for the ith data point, and y^i is the predicted probability of the positive outcome for the ith data point.

Maximum Likelihood(MLE- Maximum Likelihood Estimation)

The likelihood function measures the goodness of fit of the model by quantifying the probability of observing the given set of outcomes (binary responses) given the predictor variables.

Mathematically, the likelihood function is defined as the product of the probabilities of observing the actual outcomes (y_i) for each data point (i) in the dataset, given the predicted probabilities (p_i) from the logistic regression model:

Here, p_i represents the predicted probability of the positive outcome for data point i, and y_i is the actual outcome, either 0 or 1.

Gradient Descent

Gradient descent is used as an optimization algorithm to find the optimal parameter values that maximize the likelihood(minimize cost).

Minimizing the cost function is equivalent to maximizing the likelihood of the observed outcomes.

Lets try to visualize above metrices using a simple dataset which has single feature and 2 classes.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression

# Set the random seed for reproducibility
np.random.seed(42)

# Define the sigmoid function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Generate a dataset with two concentrations
X, y = make_blobs(n_samples=200, centers=2, random_state=42)

# Plot the dataset
plt.figure(figsize=(12, 4))
plt.subplot(141)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Dataset with Two Concentrations')
plt.colorbar()

# Fit logistic regression to the dataset
logistic_regression = LogisticRegression()
logistic_regression.fit(X, y)

# Plot the decision boundary
plt.subplot(142)
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                     np.arange(y_min, y_max, 0.02))
Z = logistic_regression.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Decision Boundary')
plt.colorbar()

# Plot sigmoid curve
plt.subplot(143)
x_values = np.linspace(-10, 10, 100)
sigmoid_values = 1 / (1 + np.exp(-x_values))
plt.plot(x_values, sigmoid_values)
plt.xlabel('x')
plt.ylabel('sigmoid(x)')
plt.title('Sigmoid Curve')

# Plot the cost function against sigmoid(x)=1 and sigmoid(x)=0
plt.subplot(144)
sigmoid_1 = sigmoid(x_values)
cost_sigmoid_1 = -np.log(sigmoid_1)
cost_sigmoid_0 = -np.log(1 - sigmoid_1)
plt.plot(sigmoid_1, cost_sigmoid_1, label='sigmoid(x)=1')
plt.plot(sigmoid_1, cost_sigmoid_0, label='sigmoid(x)=0')
plt.xlabel('sigmoid(x)')
plt.ylabel('Cost')
plt.title('Cost Function')
plt.legend()

plt.tight_layout()
plt.show()

(You can find the full code in my GitHub.)

Try to compare these plots with above explained theories. Also you can clearly observe the cost is minimized when it reaches 0 or 1(values of two classes). Sigmoid curve(hypothesis function) represent the probabilities for each data point.

Model evaluation techniques

Cost vs Iterations(epoch) plot

We can plot the cost against the trainings. After each training, the cost should be decreased. Then the graph should looks like following.

If the plot has above behaviour, then it can be considered as a better one.

Also, we can use following metrices to evaluate the model. Let’s assume we already predicted the probabilities for the existing set of values(labels). We can use following code to calculate the metrices. Let’s evaluate them.

import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score

# Generate random predicted probabilities and true labels
np.random.seed(0)
num_samples = 1000
predicted_probs = np.random.rand(num_samples)
true_labels = np.random.randint(2, size=num_samples)

# Threshold the predicted probabilities to obtain binary predictions
predictions = (predicted_probs >= 0.5).astype(int)

# Calculate evaluation metrics
accuracy = accuracy_score(true_labels, predictions)
precision = precision_score(true_labels, predictions)
recall = recall_score(true_labels, predictions)
f1 = f1_score(true_labels, predictions)
cm = confusion_matrix(true_labels, predictions)
auc_roc = roc_auc_score(true_labels, predicted_probs)

# Print the evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 score:", f1)
print("Confusion Matrix:")
print(cm)
print("AUC-ROC score:", auc_roc)

(You can find the full code in my GitHub.)

Accuracy

Accuracy measures the overall correctness of the predictions. It is the ratio of the number of correct predictions to the total number of predictions. In this case, the accuracy is approximately 0.511, indicating that around 51.1% of the predictions were correct.

Precision

Precision is the proportion of true positive predictions out of all positive predictions. It measures the accuracy of positive predictions. In this case, the precision is approximately 0.536, indicating that around 53.6% of the positive predictions were correct.

Recall

Recall, also known as sensitivity or true positive rate, is the proportion of true positive predictions out of all actual positive instances. It measures the ability of the model to identify positive instances. In this case, the recall is approximately 0.494, indicating that around 49.4% of the actual positive instances were correctly identified.

F1 score

The F1 score is the harmonic mean of precision and recall. It provides a balanced measure that considers both precision and recall. In this case, the F1 score is approximately 0.514, indicating a trade-off between precision and recall.

Confusion Matrix

The confusion matrix is a table that shows the true positives, true negatives, false positives, and false negatives.

The matrix shows that there are 252 true negatives (predicted negative and actually negative), 224 false positives (predicted positive but actually negative), 265 false negatives (predicted negative but actually positive), and 259 true positives (predicted positive and actually positive).

AUC-ROC score

The AUC-ROC (Area Under the Receiver Operating Characteristic Curve) score measures the model’s ability to discriminate between positive and negative instances across different classification thresholds. It ranges from 0 to 1, where 0.5 represents a random classifier and 1 represents a perfect classifier. In this case, the AUC-ROC score is approximately 0.505, indicating a weak classifier that performs only slightly better than random guessing.

These evaluation metrics provide insights into the performance of the binary classification model. Accuracy, precision, recall, and F1 score are commonly used to evaluate the model’s performance, while the confusion matrix provides a detailed breakdown of the predictions. The AUC-ROC score measures the overall discriminatory power of the model.

Now we have learned all the basic concepts of binary logistic regression. What if the behaviour of the data is not binary(have several classes)? In this case we use multiclass logistic regression

Multiclass logistic regression

Multiclass logistic regression with softmax activation is an extension of binary logistic regression that allows us to classify data into more than two classes. It is a popular algorithm used in machine learning for multi-class classification problems. To understand multiclass logistic regression, let’s first review the basics of logistic regression.

Logistic regression is a binary classification algorithm that predicts the probability of an input belonging to a particular class. It uses the logistic function (also known as the sigmoid function) to model the relationship between the input features and the binary output. The logistic function maps any real-valued number to a value between 0 and 1, making it suitable for probability estimation.

The logistic regression model can be represented as:

logit(p) = β₀ + β₁x₁ + β₂x₂ + … + βₚxₚ

Here, logit(p) is the log-odds of the probability p of the positive class, β₀ is the intercept term, β₁, β₂, ..., βₚ are the coefficients for the input features x₁, x₂, ..., xₚ.

To extend logistic regression to multiclass classification, we use the softmax activation function instead of the sigmoid function. The softmax function generalizes the sigmoid function to multiple classes and outputs a probability distribution over all classes. It ensures that the predicted probabilities sum up to 1.

The softmax function can be defined as follows:

softmax(zᵢ) = exp(zᵢ) / Σⱼ(exp(zⱼ))

Here, zᵢ represents the logit for class i, and zⱼ represents the logit for class j. The softmax function exponentiates the logits and normalizes them by dividing each exponentiated value by the sum of all exponentiated values.

In multiclass logistic regression, we aim to learn the coefficients β₀, β₁, β₂, ..., βₚ for each class, similar to binary logistic regression. The model predicts the probability of an input belonging to each class and assigns it to the class with the highest probability.

To illustrate multiclass logistic regression, let’s consider a simple example with two input features (x₁ and x₂) and three classes.

import numpy as np
import matplotlib.pyplot as plt

# Generate some random data
np.random.seed(42)
num_samples = 100
x1 = np.random.normal(loc=-2, scale=1, size=num_samples)
x2 = np.random.normal(loc=1, scale=1, size=num_samples)
X = np.vstack((x1, x2)).T
y = np.random.randint(low=0, high=3, size=num_samples)

# Plot the data
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1)
plt.xlabel('x1')
plt.ylabel('x2')
plt.title('Multiclass Logistic Regression - Data')
plt.show()

(You can find the full code in my GitHub.)

The code above generates a random dataset with two input features (x₁ and x₂) and assigns labels 0, 1, and 2 to three classes. The scatter plot visualizes the data points with different colours representing each class.

To perform multiclass logistic regression, we can utilize libraries like scikit-learn in Python. Here’s an example using scikit-learn’s LogisticRegression class with softmax activation:

from sklearn.linear_model import LogisticRegression

# Create a logistic regression model with softmax activation
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')

# Fit the model on the training data
model.fit(X, y)

# Plot the decision boundaries
x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, 0.02),
                       np.arange(x2_min, x2_max, 0.02))
Z = model.predict(np.c_[xx1.ravel(), xx2.ravel()])
Z = Z.reshape(xx1.shape)

plt.contourf(xx1, xx2, Z, cmap=plt.cm.Set1, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1)
plt.xlabel('x1')
plt.ylabel('x2')
plt.title('Multiclass Logistic Regression - Decision Boundaries')
plt.show()

(You can find the full code in my GitHub.)

The code above creates a LogisticRegression model with the 'multinomial' option for multi-class classification and uses the 'lbfgs' solver to optimize the model parameters. The model is then trained on the input features X and their corresponding labels y. Finally, the decision boundaries are plotted along with the data points as following.

multiclass logistic regression visualization

This example demonstrates how multiclass logistic regression with softmax activation extends the binary logistic regression concept to handle multiple classes. The softmax function enables us to obtain a probability distribution over all classes, and the model predicts the class with the highest probability for each input.

Keep in mind that this is just a simple illustration, and in practice, you may encounter more complex datasets with higher-dimensional features. However, the underlying principles of multiclass logistic regression and softmax activation remain the same.

Logistic Regression

Probability

Transforming Linear Regression to Logistic Regression

Decision Boundary

Cost Function(loss function), Maximum Likelihood and Gradient Descent

Cost function

Maximum Likelihood(MLE- Maximum Likelihood Estimation)

Gradient Descent

Model evaluation techniques

Cost vs Iterations(epoch) plot

Accuracy

Precision

Recall

F1 score

Confusion Matrix

AUC-ROC score

Multiclass logistic regression

Written by Sandun Dayananda