Summary Statistics

In statistics, we want to gather data about the real world and then analyze it. What we don't want is to have to recite a long-winded spiel whenever someone asks us for our results. We want some ways to summarize our data to use in statistics. It's fitting, then, that we have summary statistics to help us do that.

In The Center Ring

The first two summary stats we'll work with are old acquaintances of ours. They've agreed to help us out, as long as we forget about the twenty bucks they owe us. The first one is the mean, aka the average, aka the fair cookie giver. Wait, we almost forgot one its nicknames: x. It's pronounced "x-bar," and we don't really get it either. Whenever the topic is statistics, the mean refuses to go by any other name.

The mean's rival is the median. They both measure the "central tendency" of a dataset. That's a fancy way of saying "where the center is." You might think that we don't need different ways to find the center, that we could just point to the middle and go, "There it is." It's more complicated than that. Not rocket surgery levels of complicated, though.

The difference between the two is that the mean has a tendency to wander off towards any far-out data points. If the data is nice and symmetric, the mean and median agree with each other.

1, 2, 3, 4, 5

We've arranged this highly realistic dataset from lowest to highest, so it's easy to see that the median is 3. Taking the average—sorry we mean x—also gives us a 3 for the data's center.

1, 2, 3, 5, 5

Now our data is slightly skewed to the side. Just a smidge. The median is still sitting at 3, but now the mean has bumped up to 3.2. The higher mean tells us that our data points are a little larger than before.

1, 2, 3, 4, 1000

Okay, that is a bona fide outlier. It is way out there, and it has dragged the mean all the way up to 202. So which one is the center of the dataset, the mean or the median? We hate to go all Zen on you, but neither of them is the center, but they are both a center.

They both tell us something about the middle of the data, but they define the middle in different ways. The mean is a lot easier to calculate, and is usually pretty informative, but when there are huge outliers or lots of skew, the median has its time to shine.

Spread It Around

The last example shows that knowing the middle of a dataset isn't enough. What's missing is some idea of how spread out the data are. There's a summary statistic for that, and we call it the standard deviation. It tells us how much the data tends to deviate (i.e., differ) from the mean. Are they peas in a pod, or peas in an airplane hanger?

Sample Problem

Find the standard deviation of this dataset: {1, 2, 3, 4, 5}.

Let's take this one step at a time. The formula for the standard deviation, s, looks more intimidating than it actually is. It's really a fuzzy wuzzy teddy bear, so don't start screaming when you see it. That would just be rude.

Let's break this down into some smaller, easier-to-handle parts. The smallest, easiest-to-handle part is the mean. Reverse spoilers, we already gave it to you, and it's 3. But we're going to show the calculations anyway, because we can.

 

Now to deal with (xix). xi refers to the ith point in our dataset. For us, 1 is the first point, 2 is the second, and so on. That's so convenient we forgot to laugh. Anyway, we have to take every data point and subtract the mean from it. We start with n data points, and we'll end with n (xix) values.

x    (xx)
1-2
2-1
30
41
52

Our formula has a Σ in it, which means we'll add all our points up. We can't do that yet, though. We'd get a big fat 0, and that would do us no good. Instead, we'll square each of our n values, then add them up.

x    (xx)   (xx)2
1-24
2-11
300
411
524

Adding our squared deviations (y'know, that last column there) gets us a 10. Now we divide it by n – 1, or 4. It's almost like we're taking the average of our squared deviations, but not quite. The last step is to take the square root, for a grand total of s = 1.58.

The standard deviation is how we measure the spread of the data. It's kind of like the data's average distance to the mean but better, because we divided by n – 1. It's like putting flames and racing stripes on a car—we don't know why they work, they just do. Some points will be closer than 1.58, some will be farther, but the average distance is 1.58.

Just like the mean, the standard deviation has issues with outliers. If we went to find the standard deviation of {1, 2, 3, 4, 1000}, we would be subtracting 202 from every number then squaring the result. The final number would be big. Huge. Colossal. Embiggened beyond all reason.

In these kinds of cases, the standard deviation isn't very helpful in describing the data. We're better off giving up completely. Or using the quartiles. One of those.