Page 113 - 4660
P. 113

Coefficient of Determination


                     Coefficient of Determination


               A widely used measure for a regression model is the following ratio of sum of squares.


               Definition 4.1. The coefficient of determination is


                                                      2
                                                    R =   SS R  = 1 −  SS E  .                            (4.14)
                                                                                                              ✓
                                                          SS T        SS T
               The coefficient is often used to judge the adequacy of a regression model. Subsequently, we will
                                                                                               2
               see that in the case where X and Y are jointly distributed random variables, R is the square of
                                                                                                            2
                                                                          2
               the correlation coefficient between X and Y . And 0 ≤ R ≤ 1. We often refer loosely to R as
               the amount of variability in the data explained or accounted for by the regression model. For the
                                                           2
               oxygen purity regression model, we have R = SS R /SS T = 152.13/173.38 = 0.877; that is, the
               model accounts for 87.7% of the variability in the data.
                                  2
                                                                                                         2
                   The statistic R should be used with caution, because it is always possible to make R unity
               by simply adding enough terms to the model. For example, we can obtain a ”perfect” fit to n data
               points with a polynomial of degree n − 1. In addition, R will always increase if we add a variable
                                                                       2
               to the model, but this does not necessarily imply that the new model is superior to the old one.
               Unless the error sum of squares in the new model is reduced by an amount equal to the original
               error mean square, the new model will have a larger error mean square than the old one, because
               of the loss of one error degree of freedom. Thus, the new model will actually be worse than the
               old one.
                                                                           2
                                                             2
                   There are several misconceptions about R . In general, R does not measure the magnitude of
                                                                2
               the slope of the regression line. A large value of R does not imply a steep slope. Furthermore, R 2
               does not measure the appropriateness of the model, since it can be artificially inflated by adding
               higher order polynomial terms in x to the model. Even if y and x are related in a nonlinear fashion,
                 2
                                                              2
               R will often be large. Finally, even though R is large, this does not necessarily imply that the
               regression model will provide accurate predictions of future observations.
                   Our development of regression analysis has assumed that x is a mathematical variable,
               measured with negligible error, and that Y is a random variable. Many applications of regression
               analysis involve situations in which both X and Y are random variables. In these situations, it is
               usually assumed that the observations (X i , Y i ), i = 1, 2, . . . , n are jointly distributed random
               variables obtained from the distribution f(x, y).
                   For example, suppose we wish to develop a regression model relating the shear strength of
               spot welds to the weld diameter. In this example, weld diameter cannot be controlled. We would
               randomly select n spot welds and observe a diameter (X i ) and a shear strength (Y i ) for each.
               Therefore (X i , Y i ) are jointly distributed random variables.

                   We assume that the joint distribution of X i and Y i is the bivariate normal distribution and µ Y
                                                                 2
                     2
               and σ are the mean and variance of Y, µ X and σ are the mean and variance of X, and ρ is the
                                                                 X
                     Y
               correlation coefficient between Y and X. Recall that the correlation coefficient is defined as
                                                                σ XY
                                                          ρ =                                             (4.15)
                                                               σ X σ Y
               where σ XY is the covariance between Y and X.
                   The conditional distribution of Y for a given value of X = x is

                                                               [                      ]
                                                                    (               ) 2
                                                      1            1  y − β 0 − β 1 x
                                       f Y |x (y) = √      exp −                                          (4.16)
                                                                   2
                                                    2πσ Y |x               σ Y |x
                                                              113
   108   109   110   111   112   113   114   115   116   117   118