R-squared, often written R2, is the proportion of the variance in the response variable that can be explained by the predictor variables in a linear regression model.
The value for R-squared can range from 0 to 1 where:
- 0 indicates that the response variable cannot be explained by the predictor variable at all.
- 1 indicates that the response variable can be perfectly explained without error by the predictor variables.
The following example shows how to calculate R2 for a regression model in Python.
Example: Calculate R-Squared in Python
Suppose we have the following pandas DataFrame:
import pandas as pd #create DataFrame df = pd.DataFrame({'hours': [1, 2, 2, 4, 2, 1, 5, 4, 2, 4, 4, 3, 6], 'prep_exams': [1, 3, 3, 5, 2, 2, 1, 1, 0, 3, 4, 3, 2], 'score': [76, 78, 85, 88, 72, 69, 94, 94, 88, 92, 90, 75, 96]}) #view DataFrame print(df) hours prep_exams score 0 1 1 76 1 2 3 78 2 2 3 85 3 4 5 88 4 2 2 72 5 1 2 69 6 5 1 94 7 4 1 94 8 2 0 88 9 4 3 92 10 4 4 90 11 3 3 75 12 6 2 96
We can use the LinearRegression() function from sklearn to fit a regression model and the score() function to calculate the R-squared value for the model:
from sklearn.linear_model import LinearRegression
#initiate linear regression model
model = LinearRegression()
#define predictor and response variables
X, y = df[["hours", "prep_exams"]], df.score
#fit regression model
model.fit(X, y)
#calculate R-squared of regression model
r_squared = model.score(X, y)
#view R-squared value
print(r_squared)
0.7175541714105901
The R-squared of the model turns out to be 0.7176.
This means that 71.76% of the variation in the exam scores can be explained by the number of hours studied and the number of prep exams taken.
If we’d like, we could then compare this R-squared value to another regression model with a different set of predictor variables.
In general, models with higher R-squared values are preferred because it means the set of predictor variables in the model is capable of explaining the variation in the response variable well.
Related: What is a Good R-squared Value?
Additional Resources
The following tutorials explain how to perform other common operations in Python:
How to Perform Simple Linear Regression in Python
How to Perform Multiple Linear Regression in Python
How to Calculate AIC of Regression Models in Python
I want to perform a multiple linear regression of variables with price but I am getting an error.
X, Y = df[[“floors”, “waterfront”,”lat” ,”bedrooms” ,”sqft_basement” ,”view” ,”bathrooms”,”sqft_living15″,”sqft_above”,”grade”,”sqft_living”]], df.price
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X, Y)
ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
how do I get the multiple linear regression and R squared value