# Correlation

Obviously, variables are things that vary. Researchers are often interested in things that vary together, such as when one variable causes another. For example, time spent studying for a statistics test will “cause” an increase in scores on the test. Thus, we expect that time studying and test scores to vary together systematically. Specifically, as the variable “time” increases, the variable “score” also increases. In a relationship such as this, the reverse of the stated relationship is also true.

In our example, as the variable “time” decreases, so too will the variable “score” decrease. It is possible, even common, for one variable to get smaller as another gets larger. Say we are interested in studying the effects of drinking alcohol on driving ability. As the variable “beers consumed” goes up, we would expect the variable “driving ability” to go down.

For some research questions, you need to understand the *relationship* between two variables. For example, if an investor wants to understand the risk (as measured by volatility) of a portfolio of stocks, it is essential to for the investor to grasp how closely the returns on the stocks track each other (move up and down in price together). Researchers can determine the relationship between two variables with two basic **measures of association**: *covariance* and *correlation*. A *measure of association* is a numerical value that reflects the tendency of two variables to move together in the same direction or in opposite directions.

The covariance is a measure of association, but like the variance, it is in a funky metric that we can’t intuitively understand. Correlation is a closely related measure. It’s defined as a value between –1.00 and +1.00, so interpreting the correlation is easier than interpreting the covariance. If we are plugging the measure of association into another (usually more advanced) statistic, we often use the covariance for mathematical reasons; if we want to interpret the statistic directly, we report the correlation.

When variables are systematically related in either of these ways, then they are said to **covary**. Another word for covariation is **correlation**. The degree of this covariance is often measured with a statistic known as a **correlation coefficient**. A correlation coefficient, then, is a measure of the magnitude of the relationship between two variables.

The most commonly encountered correlation coefficient is **Pearson’s r **statistic (

*r*). Pearson’s r ranges in value from -1.0 to 0.0 to +1.0. If the value is zero, then there is no correlation between the two variables. That is, the two variables do not vary together in any systematic way. (This is an oversimplification. In reality, it means that they do not vary together in any

*linear*way, but that is a topic for later).

If the value of r is 1.0, then the correlation is said to be perfect. That means that whenever one variable (X) goes up a specified amount, then the other variable (Y) also goes up a specified amount. When this is the case, the value of Y can be predicted with perfect accuracy by knowing X. Thus, a valuable ability of correlation (and later regression) is the ability to predict something by knowing something else.

When both variables increase or decrease together, then the relationship is said to be **positive**. If one goes up and the other goes down in value, then the relationship is said to be **negative**. By convention, if a relationship is negative, a negative sign is placed before the value of *r*. Thus, a value of -1.0 represents a perfect negative correlation. If there is no negative sign, then we are to assume that the correlation is positive. Note that a correlation coefficient is not a proportion—it cannot be viewed as a fraction of something.

Don't let the signs confuse you; negative signs tell us the direction of the correlation and say nothing about its magnitude or strength.

Last Modified: 06/03/2021