Before we discuss linear regression you should have a full understanding of regression and what we generally do in regression. Because linear regression is one topic under the main topic of regression. You can understand regression through the article Regression and Linear Regression.
Polynomial regression is a supervised-classification machine learning method.
Polynomial regression is a form of regression analysis that examines the connection between an independent variable and a dependent variable by employing an nth-degree polynomial equation(Almost like linear regression). This type of regression enables the modeling of a broad range of relationships between the variables. In polynomial regression the plot(relationship) is curvilinear
The degree of the polynomial equation is determined based on the sum of the exponents of the independent variables. A higher degree polynomial has the ability to provide a better fit to the data points. However, it is crucial to be cautious, as an excessively high degree can lead to overfitting, where the model becomes too specific to the training data and performs poorly when applied to new data.
The techniques for performing polynomial regression are similar to those employed in linear regression. The coefficients of the polynomial equation can be estimated using the least squares method, which minimizes the sum of squared errors between the predicted values and the actual values. By finding the optimal coefficients, the polynomial regression model can effectively capture and represent the relationship between the variables.
Linear regression and polynomial regression are both techniques used in regression analysis to model the relationship between independent and dependent variables. While linear regression assumes a linear relationship between the variables, polynomial regression allows for more flexible modeling by incorporating polynomial functions.
In linear regression, the relationship between the independent variable (X) and the dependent variable (Y) is represented by a straight line. The equation for linear regression can be written as:
Y = β₀ + β₁X + ɛ
Here, Y represents the dependent variable, X represents the independent variable, β₀, and β₁ are the coefficients, and ɛ is the error term. The goal of linear regression is to estimate the coefficients that minimize the sum of squared residuals between the predicted values and the actual values. The relationship is assumed to be linear, and the model aims to find the best-fit line.
On the other hand, polynomial regression allows for non-linear relationships between the variables by introducing polynomial terms. The equation for polynomial regression can be written as:
Y = β₀ + β₁X + β₂X² + … + βₙXⁿ + ɛ
Here, Y represents the dependent variable, X represents the independent variable, β₀, β₁, β₂, …, βₙ are the coefficients of the polynomial terms, n is the chosen degree of the polynomial, and ɛ is the error term. By including polynomial terms up to a chosen degree, the model can capture non-linear patterns such as curves with multiple peaks and valleys.
The main difference between linear regression and polynomial regression lies in the assumption of linearity. Linear regression assumes a linear relationship between the variables, while polynomial regression relaxes this assumption and allows for more complex relationships.
When deciding between linear and polynomial regression, the choice depends on the underlying data and the relationship between the variables. Linear regression is suitable when the relationship is expected to be linear, with a constant slope. Polynomial regression is more appropriate when the relationship exhibits non-linear patterns that cannot be adequately captured by a straight line.
Let’s imagine you have a dataset and when you plot them it looks like the following figure
We can clearly see there is no linear relationship between the independent and dependent variables. Therefore, we cannot draw a straight line that fits the above data perfectly. Therefore we have to use a more complex line which has several up and downs in order to fit the data points well.
You can find the code for the above plots in my GitHub.
In the above picture, we can see what happens if we fit a linear regression and polynomial regression line. When we fit a polynomial regression line, it captures all the data points perfectly. If you calculate RMSE for the above two plots you can see that RMSE for linear lines is greater than the RMSE for curved lines. That means linear relationship underfits (fitting less correctly) this dataset.
In order to generate this polynomial line, I have used a 2nd-degree regression model. It looks perfectly fits the data points. When we increase the degree of the polynomial function, the regression line captures each data point more accurately. But that is incorrect. Because then the regression model overfits(overestimates) the data points. As you can see in the following figure, I have used a 15th-degree polynomial regression model to fits the data points and it tries to capture all the data points more than in the 2nd-degree regression model.
You can find the code for the above plot in my GitHub.
Variance, bias, and RMSE (Root Mean Square Error) are important concepts in polynomial regression that help assess the performance and quality of the model.
Variance quantifies the variability of predictions. When the model becomes more complex, the regression line tries to fit each data point. The error caused by this scenario is called variance.
Mathematically, variance can be computed as the average of the squared differences between each predicted value and the mean predicted value. It is calculated using the formula:
Variance = (1/n) * Σ(y_pred — mean(y_pred))²
where n is the number of data points, y_pred is the predicted value, and mean(y_pred) is the mean of the predicted values.
Bias measures the deviation between predictions and true values. That means avoiding fitting each data point exactly. When this scenario happens well, then it is called a high bias and then the estimated regression line can be less accurate.
Bias can be computed as the average of the squared differences between each predicted value and the true value. It is calculated using the formula:
Bias = (1/n) * Σ(y_pred — y_true)²
n is the number of data points, y_pred is the predicted value, and y_true is the true value.
RMSE provides an overall measure of prediction accuracy(which can be considered as the cost function).
RMSE can be computed as the square root of the average of the squared differences between each predicted value and the true value. It is calculated using the formula:
RMSE = √((1/n) * Σ(y_pred — y_true)²)
where n is the number of data points, y_pred is the predicted value, and y_true is the true value.
Balancing variance and bias is crucial in polynomial regression to ensure the model captures the true underlying patterns in the data while avoiding overfitting or underfitting.
Now, Let’s plot Variance, Bias, and RMSE against the model complexity(Polynomial Regression Degree).
For the test purposes, we can consider polynomials from degree 0 to 100 to generate the plots for bias, variance, RMSE against model complexity
You can get the full code from my GitHub
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Generate random data for the independent variable
X = np.random.rand(30) * 10
# Generate the dependent variable (target) using the equation
y_true = 2 + 3*X - 0.5*X**2 + np.random.randn(30) * 2
degrees = np.arange(0, 100)
errors = 
biases = 
variances = 
for degree in degrees:
# Transform the independent variable using polynomial features
poly = PolynomialFeatures(degree=degree)
X_poly = poly.fit_transform(X.reshape(-1, 1))
# Fit the polynomial regression model
model = LinearRegression()
# Predict the target values
y_pred = model.predict(X_poly)
# Calculate bias, variance, and error
bias = np.mean((y_pred - y_true)**2)
variance = np.var(y_pred)
error = np.sqrt(mean_squared_error(y_true, y_pred)) # Calculate RMSE
# Plot the variance, RMSE, and Bias on a single axis
plt.plot(degrees, variances, label='Variance', color='orange')
plt.plot(degrees, errors, label='RMSE', color='green')
plt.plot(degrees, biases, label='Bias', color='blue')
# Set the axis labels and title
plt.xlabel('Model Complexity (Degree)')
plt.ylabel('Error / Bias / Variance')
plt.title('Bias, Variance, and RMSE')
# Display the legend
# Display the plot
Here you can see, variance and bias have an inversely proportional relationship.
In the above examples, we considered polynomial regression with a single variable. If we have to use more than one variable, then the plot becomes more complex. That means we have to plot in more than 2 dimensions.
Let’s see what it looks like when we fit a linear regression model, a 2nd-degree polynomial regression model, and a 15th-degree polynomial regression model for the same dataset.
Polynomial regression(degree = 2)
Polynomial regression(degree = 15)
Just observe how the surface(plot) captures the data points with the considered degree of the polynomial regression. Also observe the bias, variance, and RMSE.