# Advanced Statistical Analysis:

*A Primer*

*Adam J. McKee, Ph.D.*

*This content is released as a draft version for comment by the scholarly community. Please do not distribute as is. *

## Section 2.1: Simple Regression

**Simple linear regression** is a statistical method that allows us to summarize and study relationships between two quantitative variables that are measured on the interval or ratio scale (continuous level variables). One element of the pair is denoted as X and is called by several different synonyms: the **predictor variable**, the **explanatory variable**, or the **independent variable (IV). **The other variable in the pair denoted Y, also has several synonyms: the **response variable**, the **outcome variable**, and the **dependent variable** are commonly seen. I personally use IV and DV most often, so that is the shorthand you will see most often in this text. You will, however, be exposed to those other terms, and it is important to commit them to memory and differentiate between them.

You may have snickered a little when you saw the terms “simple” and “regression” connected as the heading of this section. Back in the bad old days of hand calculations, it perhaps was something less than simple. In our modern world where computers are doing all the heavy lifting, it really is pretty simple. We use the term “simple” to denote the fact that we are using only one predictor variable (X). Later, we will learn about an expansion of the idea that allows us to answer research questions about several predictor variables, and well denote them with subscripts (X_{1}, X_{2}, X_{3} and so forth). For now, we will stick to the ordered pair version so that we can get some major concepts down before we complicate things with all those extra variables. So for now, we are not really doing multivariate statistics—we’re considering bivariate statistics for a bit.

**What This Section Isn’t**

We will be looking back to those repressed memories of your college algebra class for some guidance in this section. Recall that you can describe a line on a graph with a pretty simple equation? That idea will be at the heart of what we are doing. Also, we’ll need to recall the idea of a function. Remember all that *f*(*x*) stuff? I remember doing that in algebra class; the thing I remember most was that it was much, much easier than doing long division of polynomials. As a consolation prize, I will never ask you to do long division of polynomials. But that function stuff is critical.

The thing that makes what we’re doing here and those lines you had to graph in college algebra (my kids also did this in sixth-grade math, so it’s not really that bad) is that in algebra, the relationships are usually *functional*. That statement also applies to physical science (or dare I say physics?) when we consider physical laws. Remember the distance-rate-time equation? D = R x T. For example, if you are driving 65 miles per hour for exactly 2 hours, you will have traveled 130 miles. Relationships like that are known as **deterministic** (or **functional**) **relationships**. Scientists in the hard sciences have those things all the time—the front flap of a physics book is plastered with equations that fit that description. Human behavior is usually much more complicated than the physical world. If a physicist wants to know how long it will take a rocket to get to the moon, she can simply plug in the speed of the rocket and the distance to the moon into a functional equation to get the time it will take to get there. A criminologist can’t plug in a subject’s age and socioeconomic status to see if that person will be a criminal or not. Human behavior is much too complex to predict 100% of the time.

In social statistics, we’re cheating. We are not dealing with mathematical functions that give us 100% correct answers every time. What we are doing is taking some data that doesn’t really fit a straight line, and creating a function based on the closest thing we can get to a line—the line that best fits our data. What that means is that we lack the precision of the physicist. We are making approximations based on **models**. In practice, our model is an equation very similar to the one you used in algebra class to describe the line on a piece of graph paper. The fact that we are using imprecise models means that we can make predictions, but those predictions will not be 100% accurate, and the degree of accuracy will vary between individuals we plug into our prediction equations. If our model is a good one, we’ll get pretty close for most people most of the time.

To understand why we aren’t more precise (as a general rule) in making predictions, it is important to consider what a model is in the social scientific endeavor. When I was a child, I was fascinated by airplanes, especially fighter aircraft. I would buy (well, Mom would) models of these “warbirds,” and painstakingly assemble them, painting the parts as accurately as I could. When I was done with this process, I had a model plane that I hung from the ceiling. It had no engines, and could not fly. The bombs and missiles attached beneath it would not seek out and destroy enemy targets. They were, after all, only *models*. A model, by definition, provides us only an incomplete version of reality. In social scientific research (I argue), it is that incompleteness that makes models valuable. Social reality is far too complex to take in all at once. The human mind, while amazing, can only wrap around a few concepts at a time. The number of variables that have an impact (no matter how small) on human behavior approaches infinity. To really understand social phenomena, social scientists dissect (the meaning of *analysis*!) the social world and consider only a few pieces (variables) at a time. Lucky for us, not all social variables are created equal. Often, we find that our dependent variables (DVs) of interest can be mostly explained given only a handful of predictor variables (that is, IVs).

With **simple** regression analysis, we only look at the relationship between a single predictor variable and some outcome variable (Y). We glean a ton of information from that simple analysis. First of all, it can tell us if there really is a systematic, linear relationship between X and Y. If we assume that there was one according to some (bad) theory, then simple regression is a great way to invalidate the theory. It will also tell us the magnitude of the relationship between X and Y. If we determine that we can explain 0.012% of the variance (change) in Y given X, then we aren’t very impressed. In fact, with a small number like that, we’d probably chock it up to sampling error and say that there is no relationship. Our analysis can also tell us about the **direction** of the relationship. In the regression context, direction refers to whether Y goes up or down when X goes up or down. If Y goes up when X goes up, then we say that the direction of the relationship is **positive** or **direct**. If Y goes down when X goes up, then we say that the direction of the relationship between X and Y is **negative** or **inverse**. We can also use some mathematical wizardry (well, the computer can) to generate a prediction equation based on the relationship that allows us to make future predictions about the value of Y given a known value of X.

### Linear

When we say that regression is linear, we mean that it is based on a linear equation. Such equations are so called because if you graph them, what you get is a straight line. That “regression” line can be established by a simple “regression” equation:

*Y = a + bX*

In this equation, *Y* is the dependent variable (DV), and *X* is the independent variable. We call *a* the intercept and *b* the slope. The intercept is so called because it is the point on the vertical axis that “intercepts” the regression line. Another way to look at it is the value of *Y* when *X* is zero. The slope tells us how big of a change we get in *Y* we get for every one unit increase in *X*. Just like with the roof of a house, the larger the slope is, the steeper the line is.

### A “Straight” Line?

A basic **assumption** of Multiple Linear Regression is that the relationship between *X* and *Y* is linear, meaning that a straight line does a reasonably accurate job of describing the relationship. If you remember the heartbreak of graphing lines in algebra class, you’ll remember that those lines can have bends and curves.

Science, on the other hand, loves simple. We use terms like “parsimony” and “Occam’s razor” to describe the idea that all things being equal, simple is better. When we try to model reality with regression models, we sometimes run into the problem of variables that are not related such that a graph of the relationship forms a straight line. Often, however, a straight line works. Unless you have a theoretical reason to believe a curved line should be modeled, it is best to assume that a straight line will work and test the assumption by inspecting data. When considering a model to try (often called “Model Specification”), consider how the variables are related to your experience and according to theory. Take income and education for example. The more education you have, the more money you will make. We can get data from the department of labor website and test this one. A line does do a pretty good job of describing the relationship. It’s not perfect mind you—medical doctors make much more than PhDs, which isn’t at all fair! So there are differences, but nothing that can be modeled any better by putting bends in the regression line.

Now let’s consider the relationship between age and income. When you are an infant, you don’t make any money. Students don’t make a lot of money either. As you graduate from college, your income rises very sharply and continues for several years until you reach your career zenith (e.g., you’re a full professor with tenure now), and then it tapers down to 3% per year for a cost of living raise. When you retire, your income tends to plummet, and you are living on a fixed income. Every time you reach a new life stage, there is a bend in the line. In this case, a linear equation would do a pretty terrible job of modeling the behavior of income as a function of age.

### Data Considerations

To do regression analysis, you need a set of cases. By convention, individuals make up rows, and variables make up columns in a spreadsheet. Most of the time, social scientists have *people* as their “cases,” but pretty much anything you can measure will work. Economists often use units of time, for example, such as quarters and years.

Much has been said about how many cases you need to do a regression analysis. That fundamental mathematics of regression requires that you have at least as many cases as you do variables—you get an error message if you don’t. To get anything approaching reliable results, you need far more.

Variables, of course, have to be quantitative, meaning that have to be recorded as numbers. A basic assumption is that these variables are measured on the interval or ratio scale. Recall that something on an interval scale has the same magnitude of difference everywhere along the scale. Anything measured “per” something usually fits the description. Another way to conceptualize this assumption is to say that the variables must be continuous. Income measured in dollars is a good example. The number of variations is nearly infinite. Perhaps the best way to understand continuous variables is to say that they are not discrete. A discrete variable is one that takes on only specific values. If you roll a six-sided die at the casino, you can roll a 1 or a 6, but you can never roll a 1.33. The categories, then, are *discrete*.

Researchers often fudge this requirement, especially when dealing with attitudinal data. Remember that famous researcher Likert? He came up with those scales where you rank your agreement with a statement on an ordinal scale from “strongly disagree” to “strongly agree.” This type of data is commonly used in regression, but the practice is not without its detractors. One way to deal with this is to use summative scales where several Likert items are added together to achieve a scale score with much more variation than would be found in a 1 to 4 or a 1 to 10 range. Technically, regression “needs to know” that the magnitude of the distance between scores is equal, and Likert-type scales don’t tell us that—that’s why they are really ordinal.

When reviewing for a quiz on scales of measurement with my undergraduate students, I always tell them that if a bar graph makes sense, then the data are discrete. If the graph would have too many bars and not make a lot of sense, then the data can be considered continuous. Think of IQ scores as an example. The range is not infinite, and you can’t get a fractional score. Nearly everyone on the planet will score between 70 and 130, and they will get a whole number score. By a strict definition, such data are discrete. But when we use the “bar graph test,” IQ test results pass and can be used in regression analysis without difficulty.

Recall that variables that don’t have any particular order at all are considered to be measured on the **nominal scale** (because they merely *name* something). Normally, you can’t use nominal level variables in regression analysis, just as you can’t use ordinal ones. This assumption is a big problem for researchers because important variables that we want to study are often nominal. Gender, measured as male and female, is a nominal level variable. So is “political party affiliation.” So is the distinction between the “experimental group” and the “control group.” This last distinction has lead many would-be researchers to conclude that regression analysis cannot be used to analyze experimental data. This is an incorrect assumption; the experimental designs are a special case of regression that automates the ‘dummy coding’ process that you can use to analyze this sort of data with regression. This is an important concept, and we will explore it in some detail later. Suffice it to say that you can do regression with nominal level data, but it takes some special preparation.

### Ordinary Least Squares (OLS)

Ordinary Least Squares (OLS) is a computational method commonly used to get regression coefficients such as the intercept (*a*) and various slopes (*b*). There are other “fancy” methods, but they are not nearly as common—that’s what we mean by “ordinary.” The big idea is simple, but the actual computations can be quite complex. This computational complexity is why regression wasn’t used very much until the development of personal computers, after which the method exploded and came to dominate the social scientific research literature. (If you want to understand this math, see Fox [1997]. I freely admit to skipping several chapters in his seminal text).

Consider that we have an X and a Y variable in a spreadsheet. We generate a scatterplot and inspect the dots. They look something like a football, rising from the lower left-hand corner of the graph toward the upper right-hand corner. This gives us visual confirmation that the variables are systematically related (correlated) because there is a pattern. Had they been shaped like a basketball (no apparent trend), then we could conclude that they were not correlated. If the dots had all fallen in a nice, neat, straight line, then we could use that handy old equation for algebra class to describe the line and make predictions with it. Alas, the dots have not cooperated so fully. The best we can do is draw a line that splits the field of dots into two equal parts, like the seam on a football. That line—the regression line—is the best we can do at predicting future values of *Y* (with the information we have anyway). There are several different approaches to drawing this line. Some are visual—you could get ruler and “eyeball it.”

Obviously, the ruler method isn’t very precise. The “line of best fit” can be established mathematically and use an equation for the line to draw the line precisely where it is the *most accurate* regarding making predictions. That leaves us with one major issue remaining: How do we draw the line such that it is most accurate in making predictions? The most common way is to flip around the criteria to be based on *reducing errors* rather than focusing on being “more right.” When we have an observed value of *Y* and we predict that value (*Y’*), the difference between the two (Y – Y’) is an *error term*. Recall from basic statistics that sometimes statisticians like to square stuff for mathematical reasons. They like to square those error terms. The **least squares criterion** says that *we should select coefficients that make the sum of the squared prediction errors as small as possible*. Since the error terms are squared, they don’t consider the *direction* of the error, just its magnitude.

### Judging Predictions

Least squares, by definition, always produces the best set of linear predictions for a given set of data based on the error reduction strategy we discussed above. As with all areas of life, sometimes the best you can get isn’t really that good. Being the objective, scientific type, social researchers want an objective, numerical way to judge just how good the predictions given by a particular equation are. That most common statistic used for this task is the **coefficient of determination**. Also called “R-squared,” it is symbolized *R ^{2}*. If you’ve already had a statistics course, this statistic is interpreted in much the same way that we interpret

*r*. We just have to remember that

^{2}*R*applies to a whole set of predictors, not just a single predictor like

^{2}*r*. If there is just one independent variable in a model, then the value of

^{2}*R*will be the same as if you computed

^{2}*r*.

^{2}To understand what *R ^{2}* tells us, it is helpful to think of a prediction situation where we have data on

*Y*, but no predictor variables. When an equation has no independent variables, the least squares estimate for the intercept is simply the mean of

*Y*. This means that if you don’t have any predictors (IVs), the best estimate of Y will be the mean of Y—not a very useful prediction!

The basic idea of *R ^{2}*, then, is to compare the errors produced by the least squares equation and equation that uses just the intercept (mean of

*Y*) as a predictor. The following formula can be used:

*R ^{2}* = 1 – [Sum of Squared Errors (regression) / Sum of Squared Errors (intercept only) ]

We interpret *R ^{2}* as the variance in

*Y*accounted for by knowing the X value(s). We could say that it is a proportion representing the reduction in the sum of squared prediction errors compared with only using the mean,” but that has no intuitive understanding, so we don’t. You can also think of it visually as the predictive power gained by knowing the slope of the line rather than drawing a flat one from the intercept across the graph.

### References and Further Reading

Allison, P. D. (1999). *Multiple regression: A primer*. Thousand Oaks, CA: Sage.

Fox, J. (1997). *Applied regression analysis, linear models, and related methods.* Thousand Oaks, CA: Sage.

File Created: 08/24/2018 Last Modified: 08/27/2019

This work is licensed under an **Open Educational Resource-Quality Master Source (OER-QMS) License**.