Page 113 - 4660
P. 113
Coefficient of Determination
Coefficient of Determination
A widely used measure for a regression model is the following ratio of sum of squares.
Definition 4.1. The coefficient of determination is
2
R = SS R = 1 − SS E . (4.14)
✓
SS T SS T
The coefficient is often used to judge the adequacy of a regression model. Subsequently, we will
2
see that in the case where X and Y are jointly distributed random variables, R is the square of
2
2
the correlation coefficient between X and Y . And 0 ≤ R ≤ 1. We often refer loosely to R as
the amount of variability in the data explained or accounted for by the regression model. For the
2
oxygen purity regression model, we have R = SS R /SS T = 152.13/173.38 = 0.877; that is, the
model accounts for 87.7% of the variability in the data.
2
2
The statistic R should be used with caution, because it is always possible to make R unity
by simply adding enough terms to the model. For example, we can obtain a ”perfect” fit to n data
points with a polynomial of degree n − 1. In addition, R will always increase if we add a variable
2
to the model, but this does not necessarily imply that the new model is superior to the old one.
Unless the error sum of squares in the new model is reduced by an amount equal to the original
error mean square, the new model will have a larger error mean square than the old one, because
of the loss of one error degree of freedom. Thus, the new model will actually be worse than the
old one.
2
2
There are several misconceptions about R . In general, R does not measure the magnitude of
2
the slope of the regression line. A large value of R does not imply a steep slope. Furthermore, R 2
does not measure the appropriateness of the model, since it can be artificially inflated by adding
higher order polynomial terms in x to the model. Even if y and x are related in a nonlinear fashion,
2
2
R will often be large. Finally, even though R is large, this does not necessarily imply that the
regression model will provide accurate predictions of future observations.
Our development of regression analysis has assumed that x is a mathematical variable,
measured with negligible error, and that Y is a random variable. Many applications of regression
analysis involve situations in which both X and Y are random variables. In these situations, it is
usually assumed that the observations (X i , Y i ), i = 1, 2, . . . , n are jointly distributed random
variables obtained from the distribution f(x, y).
For example, suppose we wish to develop a regression model relating the shear strength of
spot welds to the weld diameter. In this example, weld diameter cannot be controlled. We would
randomly select n spot welds and observe a diameter (X i ) and a shear strength (Y i ) for each.
Therefore (X i , Y i ) are jointly distributed random variables.
We assume that the joint distribution of X i and Y i is the bivariate normal distribution and µ Y
2
2
and σ are the mean and variance of Y, µ X and σ are the mean and variance of X, and ρ is the
X
Y
correlation coefficient between Y and X. Recall that the correlation coefficient is defined as
σ XY
ρ = (4.15)
σ X σ Y
where σ XY is the covariance between Y and X.
The conditional distribution of Y for a given value of X = x is
[ ]
( ) 2
1 1 y − β 0 − β 1 x
f Y |x (y) = √ exp − (4.16)
2
2πσ Y |x σ Y |x
113