###### G. Multiple Regression

# Hypothesis Testing

Now, assume that our analysis show that both education level and number of year of work have a significant effect on yearly income. Another question we might want to know is how good a job these two factors do in accounting for differences in income. Although both are significantly related, they may account for a substantial amount of the variance or a fraction of it. These two factors may account for most of the differences in Springfield residents' incomes or only a bit.

The fact that both factors are significantly related to the outcome does not necessarily imply that they explain a substantial portion of the variance in the outcome. Or, in more formal statistical terms, it doesn't mean that our model fits the data well.

This may appear surprising but it is rather a logical result of the way the coefficients are calculated. To see how this is so, recall first that it is possible to decompose the total variance in:

The numerator in [6] actually represents the amount of unexplained variance left in Y by the regression line. It is usually called Sum of Squared Errors (SSE). One can easily picture a situation where the observations are so dispersed on the scatter plot that although high values of X are associated with high values of Y the fitting line represents a poor model. In this case we therefore expect SSE to be rather high. Intuitively, the amount of explained variance (SSR) is given by:

Using these two quantities it is finally possible to construct a statistic that measures the quality of the fitness of the model.

R^{2} has a value that is between 0 and 1. High values of R^{2} will indicate that the model fits the data well because the amount of total variance explained by the factors in the model yields a high value of SSR and a higher total value of R^{2}, while models that explain less will have a lower R^{2}.