Section 5.4: Assumptions

Fundamentals of Social Statistics by Adam J. McKee

In statistics, the validity of a test statistic depends on several assumptions being met. Assumptions are things that are taken for granted by a particular statistic. We must verify all of the assumptions associated with a particular statistic when we use it. Some statistics are considered robust against violations of certain assumptions.

This means that even when the assumption is violated, we can still have a high degree of confidence in the results. As a consumer of research, it is important to understand that if the assumptions of a statistical test are violated, then we cannot trust the statistical outcome or any discussion points the author makes about those results.

Assumptions are characteristics of the data that must be present for the results of a statistical test to be accurate.

Common Assumptions

All samples are randomly selectedMost statistical procedures cannot account for systematic bias.  Randomly selected samples eliminate such bias and improve the validity of inferences made based on statistical test results.

All samples are drawn from a normally distributed populationMost researchers do not fret over the normality requirement when comparing group means, because the effect of non-normality on p-values is very, very small.  When a distribution of scores is not normal because of an outlier, then the problem can be important to consider.  Extreme scores have an extreme impact on the mean, as well as variability and correlation.  Recall what we said about the effects of extreme scores on the mean in previous sections.  If an extreme score means that you should not use the mean, then a statistical test of mean differences makes no sense either.

All samples are independent of each otherThis assumption means that there is no reason that the scores in Group A are correlated with the scores of Group B.  If you use the same person for multiple measures of the variable that you are studying, then that person’s scores will be correlated.  Therefore, if we use a Pretest-Posttest type of design, then we violate the assumption of independent samples.  What this means is that we have to use special statistical tests that are designed for correlated scores, often referred to as repeated measures tests.  Random selection and random assignment to groups are usually considered sufficient to meet this assumption.  Statistical tests sometimes are called “independent samples” tests if they have this assumption.

All populations have a common varianceThis assumption is often referred to as the homogeneity of variance requirement.  It only applies to some tests statistics; the most common test statistics that have this assumption are the ANOVA family.  Data that meet the requirement have a special name:  Homoscedastic (pronounced ‘hoe-moe-skee-dast-tic’).  Data that violate this assumption (e.g., the two variances are not equal) can be referred to as heteroscedastic.  If you keep the treatment group and the control group around the same size (equal Ns), then this assumption is not really that important.  Different variances with widely different sample sizes will taint your results.

The Central Limit Theorem

The Central Limit Theorem (CLT) holds a paramount position in the realm of inferential statistics, serving as the foundation for many analytical techniques. The theorem asserts a compelling proposition: when one takes a sufficiently large sample from any population, irrespective of that population’s distribution shape, the distribution of the sample mean will gravitate towards a normal, or Gaussian, distribution. This revelation is not just of theoretical significance but has profound implications for the practice of statistical hypothesis testing.

One of the central aspects of hypothesis testing is the set of underlying assumptions that accompany each test. These assumptions, if met, ensure the validity of the conclusions derived from the test. One common assumption in many parametric tests, such as the t-test or ANOVA, is the normality of data. In real-world scenarios, especially in the social sciences, it’s not always guaranteed that data from populations will be perfectly normally distributed. This is where the power of the CLT comes into play.

Given the CLT’s assertions, even if the population data is not normally distributed, statisticians can feel confident that the means of their samples, as long as the samples are sufficiently large, will approximate a normal distribution. This alleviates concerns about violating the normality assumption in various hypothesis tests. In other words, the CLT provides a safety net, allowing researchers to utilize powerful parametric tests without incessantly worrying about the distribution of their original population.

Moreover, the CLT underscores the importance of understanding and checking the assumptions of statistical tests. While the theorem offers a level of protection against deviations from normality, it’s predicated on the condition of having a “sufficiently large” sample. This brings attention to another critical assumption in hypothesis testing: the sample size. The definition of “sufficient” can vary based on the original population distribution’s shape and spread. For some heavily skewed distributions, larger samples might be required to leverage the CLT effectively.

Significance in Statistical Analysis

The importance of the CLT cannot be overstated. Many statistical tests and procedures hinge on the assumption that data is normally distributed. While individual datasets or populations may not always follow this assumption, the CLT assures statisticians that the means of samples drawn from any population will have a distribution that tends towards normality. This greatly expands the applicability of various statistical tools.

Implications for Sample Size

One of the key stipulations of the CLT is the idea of a “sufficiently large” sample size. In many cases, a sample size of 30 or more is considered sufficiently large for the CLT to hold. However, this is a general guideline, and in certain situations, larger samples might be required to ensure the approximation of a normal distribution, especially if the original population distribution is particularly skewed or contains significant outliers.

Illustrating the CLT with Random Samples

To visualize the power and reliability of the CLT, one can conduct a simple experiment. By randomly drawing multiple samples from a non-normally distributed population (for instance, an exponential or a binomial distribution) and plotting the means of these samples, you’ll notice a remarkable phenomenon: the distribution of these means will increasingly resemble a bell curve as the number of samples increases.

Sampling Distribution Variability

Another crucial aspect of the CLT pertains to the variability of the sampling distribution. The standard deviation of this distribution, termed the “standard error,” is inversely proportional to the square root of the sample size. This means that as sample size increases, the spread or variability of the sampling distribution decreases, leading to more precise estimates of the population parameter.

Central Limit Theorem and Skewed Data

Even if the original population distribution is heavily skewed or has a distinctive non-normal shape, the CLT’s magic ensures that the sampling distribution of the mean becomes increasingly bell-shaped with larger samples. However, for such markedly skewed distributions, a larger sample size might be necessary to achieve a near-normal distribution of the sample mean.

Applications in Real-World Research

In practical research scenarios, especially in the social sciences, populations might not always adhere to a normal distribution. Data can often be skewed due to various factors. Here, the CLT provides solace to researchers, as they can be confident that, with a decent sample size, the means of their samples will follow a normal distribution, allowing them to conduct parametric tests and draw valid conclusions.

Limitations and Considerations

While the Central Limit Theorem offers a robust theoretical foundation, it’s essential to approach its application judiciously. Not all statistical tests rely on the sampling distribution of the mean; some might be based on other statistics, for which the CLT might not apply. Additionally, while the CLT addresses the distribution of the sample mean, it does not change the distribution of the original data. Hence, data transformations or non-parametric tests might be needed in certain analytical scenarios.

Beyond the Mean: Other Statistics

While the CLT is most commonly associated with the sampling distribution of the mean, it’s worth noting that it can be applied to other sample statistics as well. For instance, sums, proportions, and variances, under certain conditions and given large enough sample sizes, will also have sampling distributions that approximate normality.

In the landscape of inferential statistics, few theorems hold the stature and applicability of the Central Limit Theorem. By ensuring that sample means from even non-normally distributed populations tend towards a normal distribution, the CLT lays the groundwork for numerous statistical tests and methodologies, underscoring its pivotal role in data-driven research and decision-making.

Key Terms

Hypothesis, Sample, Population, Generalization, Inference, Test Statistic, Research Hypothesis, Null Hypotheses, p-values, Alpha Level, Type I Error, Type II Error, Power, Assumptions, One-tailed Test, Two-tailed Test


[ Back | Contents | Next ]

Last Modified:  09/25/2023

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.