Regression
Mathematical and machine learning approach
First, we need to clear out, what is the meaning of regression
Let’s think there are some data points(x, y coordinates) in a given graph. You have to draw a line(the line can be straight or curved) that goes through all the data points or almost all the data points. So, you have to try to draw the best possible line. This technique is called regression
In the above images, I have tried to draw a red line through the data points as best as I can. This is called regression
There are 3 main types of regression
- Linear regression
- Polynomial regression
- Logistic regression
These are supervised machine learning techniques.
Now you know the basic idea and direction of regression. We can also think about the regression in a mathematical way
The mathematical approach of regression:
You can guess, what I’m going to do. Yes, I’m going to draw graphs in the way you all know.
It is, “y = f(x) , defining y as a function of x”
Here x is the independent variable and y is the dependent variable.
When the graph is linear, the function is
y = mx + c
When the graph is nonlinear (curve shape), the common function is
y = b0 + b1x + b2x² + b3x³ + … + bnx^n
n is the degree of the polynomial
The shape of this function differs with the degree of the function. (degree of the function: highest exponent of any variable in the function).
Let’s talk about the practical usage of regression
Think, you have been observing the price of an antique painting. You have the price values for several previous years. Now, you want to estimate the price of the same antique painting after several years. What you can do is find the correlation between years and the price. That means you have to build a function that includes the relationship between years and prices. You can think of years as x and prices as y(the correlation between x and y can be linear or nonlinear(then the function would be a polynomial)).
As you can see in the figure, you can estimate(predict) the price of the antique painting using regression. For this, you have to build an accurate correlation between x and y.
When we draw this regression line, we should be careful to draw the most accurate line. The most accurate line means the line which has the least vertical distance with the data points.
Regarding the previous example, we should draw the regression line in a way that hi, h2, h3,… have their least values. In order to do this, we use two concerns in the regression.
- Cost Function.
- Gradient Descent.
The cost function evaluates the suitability of the coefficients of the variables in the hypothesis function(this is also called as mapping function which formulates the correlation(relationship) between x and y variables) while the gradient descent method suggesting the different coefficients for the hypothesis function.
You can have a deep understanding of cost function and gradient descent in the linear regression article.
Difference between Interpolation and Regression
In interpolation, the goal is to create a function or curve that passes exactly through every given point in a dataset. Interpolation can be likened to an algorithm that lacks a “brain” or the ability to generalize beyond the given data. Its primary goal is to achieve a precise match to the provided data points. This means that the interpolated curve or function goes through each data point, ensuring an exact fit.
On the other hand, regression focuses on finding a curve or function that approximates the relationship between the data points without necessarily passing through each individual point. Regression aims to minimize the overall distance or error between the predicted curve and the data points, striving for the closest possible fit. While it may not achieve a perfect fit for the data points, regression strives to learn valuable insights from the data. Rather than focusing solely on the exact match to the given data, regression aims to find a more general function or model that captures the underlying trends and patterns present in the data. This allows regression to make predictions or estimates for new or unseen data points based on the learned insights from the provided data.
Interpolation demands a perfect match to all points, whereas regression seeks the best possible approximation while allowing for some deviation from the exact data points.
However, there are significant disadvantages associated with employing regression analysis. One notable drawback is that it can create fake relationships between variables that are actually unrelated. This occurs when the regression model mistakenly suggests a correlation between variables due to chance or other confounding factors, even though no true relationship exists.
Another drawback is Simpson’s paradox
Let’s discuss above mentioned main 3 regression types deeply(linear, polynomial, and logistic) in another article.