Everyone who earns money keeps track of their earning and spending so that they can get to know how much they have spent in the entire month and we all agree that is a good habit to keep a track but some people estimate their spending of a month on the first day and if the money isn’t enough they have to spend it from their savings. Similarly in data science, data scientists have to check their algorithm by using different performance parameters so that they can have an early indication of a problem or failure before it occurs. By using a performance parameter one can get to know about the performance of the model and also it is an easier way to represent it to other people. So in this blog, we will discuss different types of performance parameters used for the evaluation of Linear Regression.
What is Linear Regression?
It is one of the very simple and easy algorithms which works on regression. It shows the relationship between the continuous variables. It shows the linear relationship between the independent variable (X-axis) and the Dependent variable(Y-axis).
There are 3 main metrics for model evaluation in regression:
1. R Square or Adjusted R Square
2. Mean Square Error(MSE)/Root Mean Square Error(RMSE)
3. Mean Absolute Error(MAE)
R Square/Adjusted R Square:
It measures how much variability in dependent variables can be explained by the model. It is the square of the Correlation Coefficient(R) and that is why it is called R Square.
R Square is calculated by the sum of squared of prediction error divided by the total sum of squares which replace the calculated prediction with mean. R Square value is between 0 to 1 and a bigger value indicates a better fit between prediction and actual value. R Square is a good measure to determine how well the model fits the dependent variables. However, it does not take into consideration of overfitting problem. If your regression model has many independent variables, because the model is too complicated, it may fit very well to the training data but performs badly for testing data.
That is why Adjusted R Square is introduced because it will penalize additional independent variables added to the model and adjust the metric to prevent overfitting issues.
Problems with the R-squared:
R-squared comes with an inherent problem — additional input variables will make the R-squared stay the same or increase (this is due to how the R-squared is calculated mathematically). Therefore, even if the additional input variables show no relationship with the output variables, the R-squared will increase.
Mean Square Error(MSE)/Root Mean Square Error(RMSE):
Mean Square Error is an absolute measure of the goodness for the fit.MSE is calculated by the sum of the square of prediction error which is real output minus predicted output and then divide by the number of data points. It gives you an absolute number on how much your predicted results deviate from the actual number.
You cannot interpret much insights from one single result but it gives you a real number to compare against other model results and help you select the best regression model.
Root Mean Square Error(RMSE) is the square root of MSE. It is used more commonly than MSE because firstly sometimes MSE values can be too big to compare easily. Secondly, MSE is calculated by the square of error, and thus square root brings it back to the same level of prediction error and makes it easier for interpretation.
Mean Absolute Error(MAE):
It is similar to Mean Square Error(MSE). However, instead of the sum of the square of error in MSE, MAE is taking the sum of the absolute value of error.
Compare to MSE or RMSE, MAE is a more direct representation of the sum of error terms. MSE gives larger penalization to big prediction errors by square it while MAE treats all errors the same.
Now let us take an example to understand all these metrics:
Let’s take two different modeling problems:
a. Forecasting demand for a retailer’s goods.
b. Building a statistical model of the temperature of a controller device in a power plant.
In the first case, the error scales linearly. If the model forecasts that 10 units less will be sold, than they actually are, then the retailer is losing the profit of these 10 units. If the model predicts higher demand, then the retailer might find that there is some surplus stock, but if the retailer is in a domain where the goods do not expire (e.g. electronics), then this is not a big deal.
In the second scenario, we have a controller device that we know is in danger of breaking down when the temperature gets too high. In that case, the error is highly non-linear. Small deviations from the predicted temperature are not important, but if the model makes 1 large prediction, then the whole system could face catastrophic failure.
Therefore, the RMSE is better suited for the second scenario, whereas the MAE is better suited for the first.
Now Let’s calculate the Mean Absolute Error:
So here first we need to find the absolute difference between the predicted value and actual value.
And now we find the mean of the error
Mean = (Sum of absolute error)/(total observation)
= 22 / 12
Now Let’s calculate the Mean Square Error:
Here we need to find the squared error
Now we calculate the mean of the squared error:
Mean = (sum of squared error)/(total observations)
From the above example, we can observe the following.
1.As forecasted values can be less than or more than actual values, a simple sum of differences can be zero. This can lead to a false interpretation that the forecast is accurate
2.As we take a square, all errors are positive, and the mean is positive indicating there is some difference in estimates and actual. The lower mean indicates the forecast is closer to the actual.
3.All errors in the above example are in the range of 0 to 2 except 1, which is 5. As we square it, the difference between this and other squares increases. And this single high value leads to a higher mean. So MSE is influenced by large deviators or outliers.
Now we calculate the Root Mean Square error for the given data:
Here we have the process as the mean square error, the only extra part to perform is the square root of the mean.
The mean as we calculated above is 4.6666667. So, the square root of the mean will be = 2.1602.
Now we go on and calculate the R² for the data;
First, we calculate the squared difference between predicted and actual values.
The sum of the squared error is = 56
Now we calculate the squared difference between actual and mean of actual values.
Mean of actual values = (sum of actual values)/(total observations)
The sum of these values is = 602.6666667
Now we know the formula for R squared = 1 — (Sum of squared error)/(Sum of Squared Difference)
R square = 1 — (51.333333)/(602.6666667)
Python Implementation using sklearn and formula: