t-Tests with Effect Size

Finding that a difference in a sample likely represents a difference in the population fails to answer one important question: How big is the difference? All rejecting the null hypothesis tells us is that there is very likely a difference between the means in the population. To understand the size of the difference, we must look at the actual means. A potential problem with examining the mean differences is the matter of scale. If one researcher is using a 100-point scale to measure intelligence, and another is using a 200-point scale, then the means are not comparable, nor are the mean differences. We are comparing apples and oranges.

One way to get around this problem is to standardize the differences by computing a measure of effect size. The most commonly reported measure of effect size for t-tests is a statistic known as Cohen’s d. While various researchers utilize different variants on the formula, the most basic method of computing d is to divide the difference between the two means by the standard deviation of the pretest scores. This in effect standardizes the difference between the means in standard deviation units. Recall that the practical limit of standard deviations is +3.00 and -3.00. The takeaway from all this is simple: d is simply a measure of the difference between two means, measured in standard deviation units.

While the matter is hotly debated among researchers, a sort of convention for describing various effect sizes has developed. An effect size of less than .1 is considered “trivial.” An effect size between 0.1 and 0.3 is considered “small,” and effect size between 0.3 to 0.5 is considered “moderate” and an effect size greater than 0.5 is considered “large.” When d is close to zero, there is no effect. Note that as with correlations, the sign of d does not indicate magnitude; it indicates direction. Thus, an effect size of -0.5 is of equal magnitude (largeness) as an effect size of +0.5.

Another commonly reported measure of effect size is r. This r is interpreted in the same way that Pearson’s r is interpreted. It is a correlation, and r² can be interpreted as the proportion of variance that the two variables share. As we will demonstrate below, you can compute this term given data that violate the assumptions of Pearson’s r if you have already done a t-test.

It is important to understand that effect size and statistical significance are largely unrelated concepts. When a social scientist conducts an experiment, is hypothesized that the treatment will have an effect. Statistical significance just says that there is a specified percent chance (100% – Alpha) that the effect exists. That is, the observed differences between the means of the groups are very unlikely to be caused by random chance in the assignment process. The logic of hypothesis testing dictates that if chance is not to blame, then it must be the treatment causing the observed differences. This gives the researcher strong support that the treatment worked. Statistical significance is silent on the issue of how well or to what degree the treatment worked (although more “powerful” effects lend power to null hypothesis significance tests). In contrast, that is exactly what a measure of effect size like Cohen’s d does. It tells us how big the effect is. It does this in terms of the sample standard deviations.

What all this suggests is a two-part question that the researcher must ask: First, does the effect exist? We answer this question with a statistical significance test. If we reject the null hypothesis and accept the idea that the effect exists, then we must ask a second question: How big (magnitude) is the effect? We answer this by computing a measure of effect size.

[ Back | Contents | Next ]

Last Modified:  06/03/2021

Leave a Reply Cancel reply