*This content is released as a draft version for comment by the scholarly community. Please do not distribute as is. *

## Examining Data

If you’ve ever taken a computer science class, you probably remember the oft-repeated adage “Garbage In, Garbage Out.” This idea is critically important to the social scientific endeavor in general, and to regression analysis in particular. The quality of your research product will be directly proportionate to the quality of your data. (This is so important, in fact, that I recommend that you take a measurement class if at all possible before completing your graduate studies. You will learn all about topics like reliability and validity that make for excellent quality research).

Before you can use (ordinary least squares) regression analysis, it is important to determine if your X variables are related to your Y variables in a linear way. That is, does a straight line do a good job of describing the relationship? This basic assumption can be violated when there is no relationship between the variables (a terrible thing if you hypothesized such a relationship), or when there is a relationship that is better described by a curved line (or a complicated bendy line). Perhaps the easiest way to verify a linear relationship is to produce a simple scatterplot that shows X and Y for each ordered pair. If you do this in excel, you can also generate a prediction equation and fit a line to those points. Note that in the social sciences, we usually call the line the “regression line.” In econometrics (and this in Excel) it is called a “trend line.” It is the same thing.

Another important concept in regression analysis is the idea of **multicollinearity **(aka collinearity). Remember that a major purpose of regression analysis is to partition variance and ascribe the degree of relatedness in the DV. If there is a high correlation between the individual predictor (X) variables, then regression has no way to know how much “credit” to give each X in “causing” changes in Y. In other words, *multicollinearity* exists when two or more of the predictors in a regression model are moderately or highly correlated. Unfortunately, when it exists, it can wreak havoc on our analysis and thus limit the research conclusions that the researcher can draw. Specifically, it can cause a host of problems, including:

- the estimated regression coefficient of any one variable depends on which other predictors are included in the model
- the precision of the estimated regression coefficients decreases as more predictors are added to the model
- the marginal contribution of any one predictor variable in reducing the error sum of squares depends on which other predictors are already in the model
- hypothesis tests may yield different conclusions depending on which predictors are in the model

Note that collinearity is not a binary option, but a continuum from “very little” to “a huge amount”; often there will be a small degree of relatedness between the different X variables, but we can safely ignore *small* relationships. It is when we observe a moderate to high degree of interrelatedness that we grow concerned. Simply put, a little bit of collinearity is okay, but if it’s a lot, then we have to do something about it.

One potential solution to this annoying problem is to design our research (specifically our measurements) such that our predictor variables are not related. Say, for example, we decide to conduct a study of college success using students’ GPA after eight semesters as our DV. If we operationalize “preparedness” by using both ACT and SAT scores, we will likely run into a wall of collinearity because ACT scores and SAT scores are likely to be highly related (because they purport to measure very similar things). The cautious researcher would predict this relationship “muddying the waters” and choose to use only one such measure (or use a more advanced technique that takes collinearity into account, such as SEMs).

Multicollinearity happens more often than not in observational studies. And, unfortunately, regression analyses most often take place on data obtained from observational studies. (Most social scientific studies are observational because of ethical constraints). If you aren’t convinced, flip through the pages of the latest issue of your favorite journal. You will most likely find that “true experiments” are a rarity. It is for this reason that we need to fully appreciate the sway of multicollinearity on our regression analyses.

To complicate things even further, we have to consider two different types of collinearity. The first is **Structural multicollinearity**. This type of collinearity occurs when there is a mathematical artifact caused by creating new predictors from other predictors (e.g., by data transformations)—such as creating the predictor X^{2} from the predictor X. If you have to have a problem, then this is a good one. You caused it, so you can fix it be tweaking your data file.

**Data-based multicollinearity**, on the other hand, is often a result of a poorly designed experiment, which can be very difficult to fix after the data are collected. This malady is also caused by reliance on purely observational data or the lack of ability to manipulate the system on which the data were collected. Because data-based multicollinearity is usually a design issue, you should give your variables careful consideration in the design phase of your study.

File Created: 08/24/2018 Last Modified: 08/24/2018

This work is licensed under an **Open Educational Resource-Quality Master Source (OER-QMS) License**.