# The Regression Model

It's easier to explain the ideas of multiple regression by starting with the relationship between some outcome and one factor. Suppose, for example, we want to know something about how education level affects income, with a particular interest in whether there are significant financial rewards to higher education. To study this issue, we have collected information on both education level and income from every resident of the town of Springfield (the Friends of Springfield Association undertook the task of collecting detailed information from each of the town's 8,175 residents). The outcome in this example (or the dependent variable) is annual income, measured in thousands of dollars, while the independent variable is number of years of school.

One quick way to gain an initial understanding of the relationship between education and income (or any two variables) is to plot them. Plotting these two variables produces the following graph:

The graph shows enough evidence of a linear pattern between the two variables: higher values of X are associated with higher values of Y and vice-versa. From this graph, it seems that a linear model does a good job of describing the relationship between annual income and number of years of school. (There are more sophisticated methods of evaluating linearity, but examining a graph such as this one is often sufficient).

We can express the relationship between income and education through the following formula:

That is, equation 1 says that income can be expressed as a linear function of the number of years of education. Moreover, income can be expressed as a function of some multiple of education level (b1) added to some value (b0). Regression is a tool that allows us to take income and education data from some sample and use these data to estimate v0 and b1. These values are then used to create predicted values of the outcome, with the observed or "true" value from the data designated as "y" and the predicted value as .

## A Note on Linearity

It is important to make certain that a linear relationship exists between the factors before running a regression model. Regression is not an appropriate method of analysis for non-linear relationships, such as that shown in the graph below comparing female life expectancy with the number of doctors per million persons in the population.

So we can express income as a linear function of education and regression is a technique to establish the parameters b0 and b1 in the linear equation [1]. But there are many possible ways to estimate these parameters - and many possible values for b0 and b1. Which is chosen and why?

In "ordinary least squares" (OLS) regression analysis, [1] is selected to minimize the sum of the squared distances of the errors (e), where errors are defined as the difference between the observed value and the predicted one.

where are the values of y predicted from equation [1]. It is this formula that is responsible for the name given to regression of Ordinary Least Squares or OLS regression. The equation that generates the least value for the sum of squared terms in equation [2] is the regression line.

In equation [1], the value b1 measures the causal effect of a one unit increase of X on the value of Y. b1 is also referred to as the regression coefficient for X, and is the average amount the dependent variable increases when the independent variable increases one unit and other independents are held constant. So when independent measure increases by 1, how much does the dependent variable increase by? It increases by b1 units.

## Example

Let's return to the questions laid out earlier about the relationship between education and income in the town of Springfield. Again, we have income and education data for all 8,175 of the town's residents. Using these data, we want to use ordinary least squares (OLS) regression to estimate a regression equation of the form:

Using the data from Springfield, we get the following estimates:

We interpret the estimated value of b1 that each additional year of schooling is associated with an additional \$4,000 of income.

This equation can also be used to generate a predicted value given a specified level of education. Say we want to know the predicted income of a Springfield resident who has twelve years of education. We can use the regression equation to do so.