What is the Coefficient of Determination?
After finding the best-fit linear regression line for our data, the next question is: how well does that line actually represent or explain our data?
The measure that answers this question is the Coefficient of Determination, denoted as (read: r-squared).
Simply put, tells us the proportion or percentage of the variation (ups and downs in values) in the dependent variable (Y) that can be explained by the variation in the independent variable (X) using our linear regression model.
Coefficient of Determination from a Scatter Diagram
The value of is closely related to how tightly the data points cluster around the regression line:
-
High (approaching 1 or 100%)
HighData points are very close to the regression line.See how the data points above are very tightly packed and close to the regression line? This indicates a high value (for example, maybe around 0.95 or 95%). This means that most of the variation in Y values can be explained well by the regression line (or by variable X).
-
Low (approaching 0 or 0%)
LowData points are scattered far from the regression line.Compare this with this diagram. The points are more spread out from the regression line (the residual lines are longer). This indicates a low value (for example, maybe around 0.40 or 40%). This means that this regression line is not very good at explaining the variation in Y values; only a small portion of the variation in Y can be explained by X through this model.
Calculating the Coefficient of Determination
The easiest way to calculate is by squaring the Correlation Coefficient () that we learned about earlier.
So, if you've already calculated the value of , just square it!
Since the value of is always between -1 and +1 (), the value of will always be between 0 and 1.
Mathematically (using Sum of Squares):
The value of can also be calculated directly using the Sum of Squares values used to calculate :
Interpretation as a Percentage
The value of is often converted into a percentage (by multiplying by 100) for easier interpretation.
- If , it means that 81% of the total variation in variable Y can be explained by the variation in variable X through the linear regression model.
- The remaining variation ( or 19% in this example) is explained by other factors not included in the model (could be other variables, or random error).
The higher the percentage of , the better our linear regression model is at explaining the relationship between X and Y.