# Advanced Statistical Analysis:

*A Primer*

*Adam J. McKee, Ph.D.*

*This content is released as a draft version for comment by the scholarly community. Please do not distribute as is. *

## What is Regression Analysis?

In its most basic form (the one you may remember from an undergraduate statistics course) a **regression line** is a line drawn through a field of dots on a scatter plot. In this form, regression analysis is hard to differentiate from correlation. The precise location of the “regression line” using mathematics rather than “eyeballing it” with a ruler is about as far as we go as undergraduates. Recall from your early studies of basic correlation coefficients (such as Pearson’s r) that correlations can be used to establish a relationship between two variables as well as indicate the strength (magnitude) of those relationships. Also, recall that Pearson’s useful r statistic has its limitations. Perhaps the most important (in the sense of being a limitation to what the researcher can accomplish) is that both X and Y must be measured at the interval/ratio level, only a single pair of variables can be considered, and the relationship between those two variables must be adequately represented by a straight line. Regression analysis is a powerful research tool because (with some effort) we can circumnavigate most of the problematic limitations of simpler correlations.

In an overarching sense, the most important task of regression analysis is to provide researchers with the ability to empirically verify theoretical concepts. Always keep in mind that quantitative social research begins with real-world observations that are stored as numbers. Many researchers view regression analysis as the most important tool for examining and evaluating those numerical representations of reality in a systematic way. One way to view regression analysis is as a technique for using independent variables (IVs) to explain the level of a dependent variable (DV) by deriving (with a computer!) coefficients for each independent variable. These coefficients are very similar to the *correlation coefficients* that you learned about as an undergraduate. (There are some important differences that we will delve into in later sections). Most of the time, researchers perform these regression tasks using sample data in order to glean the probable characteristics of populations. Because of this, regression can be classified among the *inferential statistics* and lends itself to hypothesis testing.

Perhaps the best way to understand regression analysis is to first understand what it is that researchers can *do* with it. There are many different purposes for using regression analysis in research, but most of them are related to one (or more) of several broad functions. The first major function of regression analysis is what has been referred to as “control modeling.” Recall that most research questions (and the resulting hypotheses that researchers develop) are about relationships between variables. *Crime* is an example of a research problem (a social phenomenon society wants to explain, predict, and ultimately control). Many criminologists have proposed that poverty is a proximate cause of crime. The theoretical waters become muddied when we consider that while poverty may well exert an influence on criminality, other variables are likely at play.

We know, for example, that gender plays a large role in criminality, especially violent crime (the species that society most wants to control). Specifically, males commit violent crimes at a much, much higher rate than do females. When testing the relationship between poverty and crime, a researcher might want to remove (control) for the effects of gender. In other words, regression analysis allows the researcher to subtract the “effect” of gender before examining the relationship between crime and poverty. The IV that the researcher is primarily concerned with is often referred to as the “variable of interest” and the independent variables that the researcher will control for in the analysis are referred to as “control variables.”

You may remember from a basic statistics course that when more than one independent variable is considered at the same time (such as with a two-way ANOVA), the researcher must consider **interaction effects**. Often, researchers find that the interaction effects in the model being tested are more interesting than the original relationship under study. Recall that when there is an interaction effect between two (or more) variables, the level of one IV depends on the level of the other IV. Regression provides several different approaches to examining these interaction effects. Often, techniques are used that consider the effects of an IV on different subgroups. We will delve deeper into these variations on the method in later sections.

As previously mentioned, regression analysis is often categorized among the *inferential statistics*. This is because regression analysis lends itself to statistical hypothesis testing, which is the inferential process of estimating coefficients for variables in the population using sample data. Recall the basic assumption of social science that *hypotheses are only supported and never proven*. The logic is that new evidence may arise to disprove an accepted hypothesis. (This also works in the natural sciences—physicists thought Newton’s laws were essentially ‘proven’ until Einstein showed that they were merely a special case of a much broader set of rules). Regression analysis is such a powerful tool in part because it allows the researcher to test several different types of hypotheses. This helps one to understand the logic of considering regression a multivariate technique—hypotheses about different variables, different sets of variables, and overall models can be included in a single analysis.

File Created: 08/24/2018 Last Modified: 08/24/2018

This work is licensed under an **Open Educational Resource-Quality Master Source (OER-QMS) License**.