Transforming Data

Fundamentals of Social Statistics

 

Recall the idea of a best-fit line.   The best fit line is most commonly drawn (such as in Excel) using a set of equations that assume that the line will be straight.  Because of this mathematically dictated assumption, a basic assumption of ordinary least squares regression is that the relationships between X and Y are best described by a straight line.  Sometimes, data that are not well described by a straight line can be “tweaked” to fit a straight line using transformations.  Data transformations can also make data “cleaner” by shifting skewed distributions to become more symmetrical.

More formally, then, transformation is the replacement of a variable by a function of that variable. For example, replacing a variable X by the square root of x or the logarithm of X.  The purpose of such a transformation is a replacement that changes the shape of a distribution or relationship.

A common transformation is the log transformation.  A log transformation of raw data points brings the score for everyone in the dataset closer together for the transformed score. Because of the “magic of compounding,” a straight line often does a poor job of describing things that happen as a percentage gain over time, such as populations in a particular place (or organization) or money in your retirement account.  If we take the log of the population value, then a straight line often does do a good job of describing the data.

A very common reason for conducting a transformation is simply convenience. In other words, a transformed scale may be as natural (that is, provides for an intuitive interpretation) as the original scale and more convenient for an explicit purpose (e.g. percentages rather than original data).   One common example in statistics is standardization, whereby scores are adjusted for differing level and spread.  Standardized values have a mean of zero and a spread (usually expressed as standard deviation units) 1 and have no units.  Therefore, standardization is useful for comparing variables expressed in different scales.  (Most commonly a standard score is calculated using the mean and standard deviation (SD) of a variable).

It is important to note that standardization makes no difference to the shape of a distribution.  That is, a skewed distribution of scores will remain skewed when standardized.  Most of the transformation made for convenience sake are convenient because they remove confusion caused by scales and provide a “pure” number that has an intuitive interpretation.  This is the same reason educators most often report grades to students as a percentage rather than a number of correct answers—we all know what a grade of 85% means intuitively.  Standardized scores are the same way for researchers; we all have been trained to think in the metric of z-scores.  Educators would be just as comfortable if you converted those scores to GPA units.  Psychologists that do a lot of testing are equally familiar with T-scores.

Another important reason that researchers often transform data is to reduce skewness.  A distribution that is symmetrical (or nearly so) is often easier to “handle” and interpret than a skewed distribution. More specifically, a normal or Gaussian distribution is often desirable because it is a fundamental assumption of many statistical methods.  The assumptions of statistical tests are critically important to the researcher because the results of statistical tests are only accurate if the assumptions are met.  To reduce right skewness, take roots or logarithms or reciprocals (roots are generally regarded as the weakest transformation method).  To reduce left skewness, take squares or cubes or higher powers.

Another reason that researchers employ transformations is to “equalize spreads.”  Each data set or subset having about the same spread or variability is a condition called homoscedasticity.  This condition of “similar spreadoutness” is desirable because it is an important assumption of many statistical tests.  The opposite condition is called heteroscedasticity.  Heteroscedastic data is a problem because it violates the homoscedasticity requirement of many statistical tests.

When looking at relationships between variables, it is often far easier to think about patterns that are approximately linear than about patterns that are highly curved. Linearity, then, is vitally important when using linear regression, which amounts to fitting a straight line to data points. For example, a scatterplot of the logarithms of a series of values against time usually has the property that periods with constant rates of change plot as straight lines.  The raw data, on the other hand, will tend to be curved.

Creating additive relationships is another reason that researchers transform data.  Additive relationships are often easier to analyze when additive rather than multiplicative (or some other transformation).  For example, our basic and beloved simple regression equation Y’ = a + bX is much easier to interpret than a multiplicative model such as Y’ = aXb.  In other words, it may be better to use a data transformation to eliminate the curve rather than to explicitly model the curve.

In practice, a transformation often works, serendipitously, to accomplish several of these purposes at once, particularly to reduce skewness, to produce nearly equal variances, and to produce a nearly linear (or additive) relationship.  It is important to note that this fortuitous state of affairs is not guaranteed.  Often, the best solution to choosing a transformation is to examine the raw scores to see if a line does a good job.  If not, then you can try various transformations to see if the fit improves.  The main criterion in choosing a transformation is what works with the data.  It is important to consider two other questions:

  1. What makes sense to researchers and consumers of research in the field?
  2. Can the researcher keep dimensions and units simple and convenient?

If possible, prefer measurement scales that are easy to think about. Often, however, somewhat complicated units are a sacrifice that has to be made so that statistical assumptions can be met. When lack of linearity is the problem, an alternative idea is to model the curve, but this only works well if there are only a couple of “bends” in the line.


[ Back | Contents | Next ]

Last Modified:  02/14/2019

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.