Advanced Statistical Analysis:
Adam J. McKee, Ph.D.
This content is released as a draft version for comment by the scholarly community. Please do not distribute as is.
Statistics and Social Science
As undergraduates in the Social Sciences, we’ve all taken an introductory-level statistics course that focuses on descriptive and bivariate techniques. There is a discussion of the logic of inferential statistics, and basic computations are often performed using simple t-test and ANOVA designs. The astute student is often frustrated by such courses because the content doesn’t align with what they see in the professional literature. Journal articles present descriptive statistics as a matter of course, but quickly move on to more advanced methods. Because the social world is complex, simple models that use only a couple of variables rarely do an adequate job of explaining real-world observations. While some commentators have accused social scientists of needlessly complicating matters (most notably statistics and research methods students), there is an undeniable complexity to human behavior that requires considering many variables in order to explain. This course is designed to bridge the gap between those very basic methods that foster an understanding of the fundamentals and logic of social research and the application of those methods in real-world research that seeks to answer questions about the social world and solve social problems.
This course is designed to provide both a conceptual and practical understanding of multivariate (meaning more than two variables) statistical techniques to students who do not need or want a deeper understanding of the mathematical processes by which results are derived. In other words, we will not be doing matrix algebra and iterative processes with pencils and notebook paper. My focus in this course has a decided bent toward intelligent consumerism. When you successfully complete this course, you will have an understanding of the techniques most commonly used in social research as presented in peer-reviewed journal articles. We’ll consider what researchers mean when they talk about “partitioning variance” and how to interpret the results of that endeavor, but we will rely on computer scientists and mathematicians to provide us with the software necessary to achieve these results.
The inner workings of statistical software packages such as SPSS and AMOS are beyond the scope of this course (and, quite frankly, beyond my personal comprehension). As social scientists, all that is necessary for conducting our research is ensuring quality data goes into the software, and that we accurately interpret the results provided. The technical processes in the middle of that process are beyond the scope of this course. I’ve spent many years in the ivory tower, and I’ve never seen a graduate student in the social sciences asked to perform matrix algebra operations during comprehensive examinations or during a dissertation defense. In short, this class is based on concepts, and as social scientists, you should be very comfortable with such abstractions. So, as we move forward, take a deep breath, keep calm, and don’t panic.
This course is designed to accommodate students from a wide variety of disciplines and from diverse backgrounds and levels of preparation for an advanced statistics course. Education, criminal justice, sociology and the balance of the social and behavioral sciences may seem very different in the problems that they solve and how they solve them, but those latter differences are not as different as one might suppose. We all adhere to the scientific method, and that overarching method circumscribes what research methods we may use. These limitations extend into the realm of quantitative data analysis.
What we find on closer examination is that most social scientists stick to a very narrow group of research methods, and this translates to a narrow range of statistical techniques that we all can use. In fact, a major theme of this course is to point out an identical underlying logic to all of the various permutations of data analysis that we can perform. We will come to see that all of the seemingly different statistical methods out there are just variations on the same basic techniques. The permutations are necessary because different researchers have different kinds of data, and technical (math) issues mean that different statistical techniques must be used. The differences are usually not theoretical or conceptual.
From the “intelligent consumerism” perspective, our objectives are to impart the necessary knowledge and skills for you to assess critically data analyses presented in the professional literature. Another common factor in graduate programs across all disciplines is the ability to conduct a quality literature review. It is crucial for you to understand what statistical methods were used, and to appropriately interpret the results of the study so that you can successfully synthesize the existing body of knowledge in your own research. This skill is critical in publishing research in the peer-reviewed journals, but it is usually the longest and most time-consuming part of writing theses and dissertations. This means that these skills are critical to you as a graduate student, and will remain important throughout your career. Even practitioners that do not conduct research must be intelligent consumers of research if they are to incorporate the latest findings into their practice.
It follows from the above discussion that our major objectives can be easily enumerated. After finishing this course, you will be able to:
- Articulate the conceptual similarities and differences between commonly used multivariate statistical techniques.
- Interpret the published quantitative research in the field’s peer-reviewed, professional literature.
- Choose and utilize an appropriate multivariate statistical technique in conducting original quantitative research (e.g., dissertations and theses).
This third objective provides us with a sort of template for studying the various methods that we’ll cover. It is hoped that when you have finished the course that you will be able to apply your general knowledge of multivariate analysis and our basic method of inquiry to teach yourself any of the advanced methods that are beyond the scope of this course. When considering any multivariate technique, I suggest you consider the following five questions:
- What are the limitations of the technique?
- What are the basic assumptions of the technique?
- What are the basic steps necessary to move from raw data to interpretable results?
- How are the results interpreted?
- How should the results be presented for publication?
As already discussed, this course will not utilize any high-level math skills. We’ll defer to computer scientists and mathematicians that design the amazing software for which we social scientists are eternally grateful. There are many, many different statistical packages in the software market, and there are many programs that can perform statistical calculations among their other functions. Because the statistics specific packages are very expensive, we will focus on the “how to” of common spreadsheet programs that can perform at least some of the techniques that we are interested in using.
For other techniques that can only be performed with expensive, specialized software, we will take a more conceptual approach, focusing on assumptions, common pitfalls, and interpretation. My anecdotal observations suggest that the most common statistical method used across the disciplines is regression analysis (OLS). This method is very versatile and can be computed using common software such as Microsoft’s Excel and Google’s Sheets. While you may need to purchase specialized software if you decide to use a different advanced technique in your dissertation research, we will not be using any of those in this course. You will be required to use either Excel or Sheets, so familiarity with those is important.
About Barriers to Understanding
I have taught statistics more than any other course in the curriculum, largely because of the sheer demand for the course and the habit of undergraduates dropping the course the first time they are exposed to dreaded numbers. Many students report that the more statistics they are exposed to, the more overwhelming it all seems. There is simply too much to “keep straight.” I argue that this is not the fault of statistics as a subject of study, but rather a fault of how social science professors and textbook author have organized the subject. The big problem, I maintain, is not with the subject matter but with the pedagogy.
I am trained in criminal justice and justice administration; I (like most other PhDs) am a subject matter expert, steeped in research methods and statistics. I have not, however, been formally trained in pedagogy. My teaching is a product of historical accident, reading about best practices on my own, and trial by fire. I lack the theoretical expertise to properly explain my thesis, so I ask your kind indulgence as a muddle through what I think about this. The major thrust of my argument is this: Learning takes place best within hierarchical frameworks. To fall back on a tired analogy, students can’t be bothered to learn about individual trees when they don’t know what the forest looks like or know why they care about that forest in the first place. All too often, statistics is presented as a collection of trees, and the tree concept is seldom related back to the forest concept.
Things move along pretty well in a basic statistics course while we are learning to organize and summarize data with descriptive statistics. We struggle a bit when hypothesis tests about mean differences are introduced. We see the immediate problem as twofold: You have to remember how to compute t and F, and you have to remember how to determine which t or F we need. The fundamental reasoning behind doing such a thing escapes us as we deal with those more pressing concerns. Each type of hypothesis test is a tree, and we need to know a lot about each tree. Then we move on to the seemingly unrelated topic of correlation, replete with scatterplots and coefficients. More trees. It wasn’t until my epiphany late in graduate school that all of those trees started to form a forest; the forest made sense, and fit well with my understanding of the social scientific endeavor. All of those trees sure got in the way!
The forest of statistics is pretty simple. The key to understanding it is to rethink what we are doing in terms of the social scientific endeavor. We have questions about the social world. Those questions usually center on how and why things happen. This really means that we want answers to causal questions about changes in behaviors. These changes can be within or between individuals. At the core of statistical analysis is the need to describe the variance in the variable we are interested in (the DV) and chop up that variance into pieces based on what caused that particular variance. Analysis of Variance is a great name! It describes exactly what we are attempting to do. It is unfortunate that this name was bestowed on a small cross-section of statistical methods because, in the end, analyzing (partitioning) variance is the goal of all statistics.
It is of little help to understand that somewhat cryptic hypothesis tests (e.g., the variants of t) do not readily demonstrate that partitioning variance is what we are really doing. Those are about mean differences, right? In a strict sense that may be true, but it does nothing for our understanding of the larger questions. When we move back from the particulars of computing t and stating our results in terms of the mean, we can return to our original question and remember that we really want to know if variance in the dependent variable is actually covariance with the treatment. In other words, is the variance in the DV caused by the variance (manipulation) of the IV?
It would be much more useful (in my opinion) to understand that Regression Analysis is a streamlined version of what mathematicians and statisticians refer to as the General Linear Model (GLM) and can do all of the stuff that t-tests and ANOVAs can do with a bunch of added benefits. The only real purpose of those “mean difference tests” is because in a bygone age they were very easy to compute with paper and pencil. (They are arguably still simpler when we have dependent data, such as with repeated measures designs). In the modern age of technology and ubiquitous computing power, courses that focus on hand computations of irrelevant subsets of the GLM are effete. It makes sense to teach those methods in the context of them being a special case of the GLM. Since they are still found in the professional literature, it would be a disservice to students to abandon them entirely; as the endgame of undergraduate statistics courses, they cause harm.
While we are delving deeply into my forest analogy, let us move the camera up to about 35,000 feet and look at the entire picture. I assume in writing this text that you are familiar with basic correlations. I’ve presented correlation (and all of the mean difference tests) as simplified “special cases” of regression analysis. We start with correlations because they are simpler to understand and there aren’t nearly as many things that you can change to accommodate different research questions and different data. Consider this: Correlation is to regression analysis as regression analysis is to Structural Equation Modeling (SEM). Aside from certain nonparametric methods and nested data methods (HLM) SEMs are the general case of which every other statistic you’ve ever heard of is merely a special case. That’s right; with SEMs you can do t-tests, ANOVA tests, ANCOVA, regression analysis, and so forth. This isn’t very surprising when you remember that all of these tests chop up variance into pieces and try to explain the source and size of each piece.
I hope that this section has shone a light on a few important matters that will aid you in establishing a mental framework on which to hang the information presented in the balance of the text. I also hope that it has justified my seeming obsession with regression analysis. Every other type of parametric test is essentially a special case of regression, so, given modern computing power, there is simply no need to resort to those archaic methods you probably spent a lot of time fretting over as an undergraduate. It seems better to go straight for the most versatile and most powerful member of the family of analytical techniques. That, you probably observed, is Structural Equation Modeling, not regression analysis. You would be correct if that were your observation. SEMs, however, require a massive amounting of computing power, very specialized and expensive software, and a lot of information from the user that regression handles by making assumptions about the data. Regression analysis provides an easy to use yet powerful and very versatile analytical tool that is available free of charge to the masses. It is understood by most everyone in the ivory tower, and is acceptable to journal editors everywhere. As our proxy for the GLM, it is the key to understanding advanced data analysis on a fundamental level.
File Created: 08/09/2018 Last Modified: 08/09/2019
This work is licensed under an Open Educational Resource-Quality Master Source (OER-QMS) License.