Regression is probably the first method that students are taught when learning data analysis as it is the most popular application within the industry. Business Companies implement regression to find out responses they want to understand to make projections or make business decisions such as forecasts to identify product prices that will be best accepted in the marketplace.
Regression has its place in inferential statistics as it makes predictions on the data set. Regression is supervised learning due to the existence of outcome variables that monitor the learning process. In the unsupervised learning process, the features do not have outcome measures and instead explain how the data are related and grouped between them.
Lineal regression explained in plain English is when a feature called the input or independent variable that can be discrete, continuous, or categorical, is able to predict a continuous output called result or dependent variable. The Simple Lineal Regression or SLR is described by this equation:
Where x is the independent variable, and y is the dependent variable. The parameter b0 is the intercept value in y when x is zero. The parameter b1 is the slope or coefficient of x which explains the mathematical relation between independent and dependent variables. This is important because if the dependent variable does not change and the independence value changes, it indicates that the coefficient b1 is zero, and there is not a lineal relation.
Apply b0 and b1 parameters to the model and use the model as an estimator to predict the target values by tracing the best line to fit the data points.
Multiple Lineal Regression uses more than one input to predict the outcome. It aims to create a model through an equation relating to two or more independent variables to generate the result. For example, if you would like to predict a house price as a result of the equation, you should analyze what are the most important input variables and how these independent variables are related among them that affect the house price. The equation that describes the multiple lineal regression is:
Where y is the target or dependent variable, x1, x2, x3…xn, are the independent variables. b0 is the intercept of y when x is cero, and b1 is the coefficient of the parameter x1, b2 is the coefficient of the parameter x2, and so on. If one coefficient value is separated from the other variables, the coefficient value is described as the change of the dependent variable’s mean because of one unit change in the independent variable.
The coefficients explain every mathematical relationship between the independent and dependent variables. The sign of the coefficient is important as positive coefficients denote increasing independent variables and dependent variables increase. As the independent variable increase and the dependent variable decrease creating a negative sign.
To measure the strength of the lineal relationship between the variables, use the Correlation Coefficient or Pearson Correlation where the value is between -1 to 1. A correlation close to 1 is a strong relationship between the variables. As the correlation gets close to zero the connection becomes weak, where zero denotes no lineal relationship, however, it does not imply that a different type of relationship as a curve may happen, and finally a negative correlation value indicates that the independent increases and the dependent decrease.
The most important concepts supporting the lineal regression are:
- The Least square seeks to find the best fitting line for a set of data. The intercept and its coefficient parameter can be calculated using the method of least square.
To understand how it works:
1. Scatterplot the data points including the regression line.
2. Calculate the Residual (graph 1). Indicating the distance from the line to each data point in Y, square the distances, and then add them up. This is the least value of the sum of squares, called SS(fit). The predicted values should be unbiased, making sure the fitted values are not too high or too low.
3. Check the residual average value, making sure it is zero, the residual median value close to zero, and the maximum and minimum residual values should be pretty much the same in absolute values. If the average residual value is another number different than zero this indicates that there is bias, and the model would make wrong predictions, too high or too low. The regression line probably will not fit the data points since it creates a different line slope from the data points. Always include the constant term b0 to avoid this bias, making the average residual value to be zero and ensuring that the line regression does not start at their origin point (0,0). In other words, the constant term b0 helps to control the bias in the regression model.
4. Calculate the Variance of the SS(fit) value divided by the sample size n.
To find out if the observed fitting line is the best, calculate the R square (R²) or the coefficient of determination. R² estimates the variation ratio of the dependent variable described by the independent variables of the model.
1. Estimate the means value of the dependent data in Y.
2. Find the SS(mean) or the sum of squares around the mean by proceeding exactly like in the least square above, measure the distance from each Y data point to the average, square the distances, and add them up together.
3. Calculate the variance around the means by taking the SS(mean) value divided by the sample size n. The variance is the average sum of squares.
4. Formula to calculate R² or coefficient of determination:
The coefficient of determination also known as R² describes the extent of variation in y supported by the dependence on x. The R² value varies from 0 to 1. A larger R² indicates a better fit demonstrating the model can better justify the variation of the output with different inputs.
The R² = 1 value indicates a perfect fit as predicted values and data points fit entirely to each other, this happens because all the residuals are cero which is not necessarily good since the fitting line will be the same as the data set provoking overfitting. On the other hand, when the R² = 0 causes the opposite since the response variable cannot explain the data points at all.
5. In Multiple lineal regression, any independent variable that does not help to decrease the least square, becomes useless since the SS(fit) does not decrease, turning its coefficient parameter to 0, and most probably If the equation is incremented with more independent variables, it will not improve the best fit line. In this case, use the adjusted R² which not only indicates how good the data point fits the line but also is adjusted based on the number of independent variables. When including a new independent variable to multiple line regression models, check the adjusted R², if the adjusted R² decreases it suggests that the variable is useless and it is better to discard it, on the other hand, if the adjusted R² increases denote that it is a useful variable, so include it to the model.
P-value is used in random samples instead of a whole population as it establishes the relation observed in the sample that also occurs in the population. In other words, to confirm or reject what happens in the sample is happening in the population. P-value tells if the connection between the input and output is statistically significant by interpreting the null hypothesis. The null hypothesis in lineal regression indicates that the coefficient value is zero and when this happens it indicates that there is not a significant statistically relationship between predictor and result. For example, in the Simple Line Regression, the null hypothesis makes the response variable a horizontal line due to the constant parameter or the means of the dependence variable and there is nothing interesting to show. The alternative hypothesis says that there is a statistical significance between the independent and dependent variables.
A p-value greater than 0.05 is considered not statistically significant and does not reject the null hypothesis. P-value less than 0.05, is statistically significant, rejects the null hypothesis, and accepts the alternative hypothesis. In a Multiple Lineal Regression equation, to decide what independent variables are best to use in the final model, first verify the correlation, then their adjusted R² and their p-value. Keep the independent variables with a strong correlation, those with increasing adjusted R² and with a statistically significant p-value smaller than 0.05.