The idea of statistical inference is closely related to the idea of what has been called Null Hypothesis Significance Testing (NHST). While controversial in elite research circles, these sorts of tests are ubiquitous in the literature across the social science disciplines. In the first course in statistics, most students learn to think of the decision reject or fail to reject *the* null hypothesis as the primary objective. In the world of multivariate statistics, there may be several such evaluations rather than a single one. Recall how the two-way ANOVA works; there are really three hypotheses. There are two factors, each with a hypothesis that must be evaluated, plus the interaction effect. That is the sort of idea we are considering here.

Let’s think back to our previous discussions for a few minutes and reconsider exactly what all this hypothesis testing stuff is about. The first thing that we must recall is that all this stuff is worthless if you have census data. We only need inferential statistics when we want to make inferences about a population using *sample* data. The problem with samples is that they sometimes don’t do a very good job of representing the population like we want them to do. You probably remember from a research methods class that bad sampling techniques can lead to samples that do a terrible job of reflecting the characteristics of the population being studied. If that is the case, then the results of our study will be useless.

That is, in essence, a research design issue, and a reminder of our mantra “garbage in, garbage out.” The other source of bad samples that don’t reflect the population we are trying to study is random chance—called **sampling error**. Recall that in this context “error” doesn’t mean we did anything bad; there is no pejorative connotation. Anything statisticians can’t explain is called “error.” In every sample, we even expect a little error. Because randomly occurring errors tend to be normally distributed, we can do a pretty good job of estimating how much error is likely in a given sample.

Your statistics professor probably beat you over the head with *mean differences* for months, causing some degree of trauma and a deep-seated misconception that hypotheses and null hypotheses are all about means. We need to expand that idea. Keep in mind that any characteristic of a population—called a **parameter**—can be estimated using sample data. Not only can we estimate means, but we can extend that idea to standard deviation, variances, covariances, and ultimately regression coefficients. Statistical hypothesis tests help us determine of obtaining a particular parameter estimate by chance (due to sampling error) is small enough to be disregarded.

Recall that researchers specify the chance they are willing to take of being wrong in disregarding chance by setting an **alpha level**. Conventionally, alpha is set at .05 or .01 (as an artifact of statistical table construction; today you are strongly advised to report *actual* probabilities). This means that the researcher is specifying that she is willing to accept a 5% chance or a 1% chance of wrongly rejecting chance (randomly generated sampling error) when it is, in fact, the cause of the observed parameter.

The above statements were necessarily vague because those “parameters” can be anything you wish to calculate. In *t*-tests, we learned that the parameter of interest was the difference between means. If we saw such a difference in the sample, were left to wonder if it really existed in the population, or whether we just had a wonky sample that made it *appear* that the treatment worked. That’s what our hypothesis test does—and *all it does*. To go beyond excluding wonky samples, you have to consider effect size and how meaningful a particular effect size is in your particular research context.

It is beyond the scope of this little book to examine thoroughly, but know that NHST is highly controversial in more sophisticated research circles. Conversely, they have been the bread and butter of social and behavioral researchers for a century, and the literature is full of them. The modern researcher will nearly always find herself in the middle of this debate. My suggestion? Straddle the fence by considering both effect size and power in your design.

### Testing the Regression of Y on X

Recall that most statistical significance tests end up being a *variance ratio* in the end. I call these “signal to noise” ratios; they are proxies for explained variance over unexplained variance. In regression, we can compute a quantity known as the **regression sum of squares** (SS_{reg}). That is the “signal”—the explained variance—just like the numerator in the F equation you used in a one-way ANOVA calculation as an undergraduate. You can also compute a residual sum of squares (SS_{res}). In regression, residuals are the difference between the *predicted* value of Y and the *actual* value of Y in the sample data.

If X does a great job of predicting Y, then residuals will be very small. If X doesn’t do a very good job of predicting Y, then the residuals will be quite large. So, if we consider the ratio of the regression sum of squares over the residual sum of squares, we get a “signal to noise” ratio just like with an F test. When we adjust that for the degrees of freedom for each quantity, it actually becomes an *F* test, and we evaluate it just like we do the results of a one-way ANOVA test. The null hypothesis for these ANOVA results is that there is no relationship between X and Y. In its simplest form, this can be thought of as r = 0.00.

### Testing R^{2}

Recall that r^{2 }indicates the proportion of variance of the DV accounted for by the IV. We can infer from this simple relationship that 1 – r^{2} is the proportion of variance of the DV not accounted for by the IV. Pedhazur (1997) demonstrates that this later quantity is also the proportion of error variance. He goes on to provide an equation for F based on values of R2 rather than the sum of squares (p. 29). This version of the computation is much more intuitive because it suggests that it is R2 that is being tested. In this context, we can infer that the null hypothesis is that R^{2} = 0.00, or that the specified model (combination of X values) has no explanatory power of the variance in Y. In computer-generated regression outputs, there will commonly be an ANOVA table. This is the test produced in that table—it is a significance test of the overall regression model as represented by R^{2}.

### Testing Regression Coefficients

As a multivariate tool, multiple regression produces an overall test of the specified model, and it also produces individual coefficients for each X variable in the model. Like other statistics, the regression coefficient, *b*, has a standard error associated with it. This is the “noise” component that allows us to construct a statistical significance test using the “signal to noise ratio” format that we’ve previously considered. The null hypothesis is essentially that *b* = 0.0. (It can also test whether *b *differs significantly from any hypothesized value, but this is rarely seen in the literature). Most researchers find that the t-distribution is better for this purpose, so regression coefficients (b) are usually evaluated with a *t*-statistic that has an associated probability. If that probability falls below your alpha level, then you reject the null hypothesis.

Last Modified: 02/14/2019