It's time we sat down and had The Talk, about where data comes from. When someone is very interested in a topic, the stork comes along and gives them a little bundle of data for them to analyze as their very own. The End.
Not buying it, huh? Well, you're probably old enough to learn the truth now. It's a story of the Populations and the Samples. No birds or bees involved.
If we want to collect some data, the first order of business is to decide what exactly we want to study. Going out and measuring things at random is a good way to get lost, beaten up, or arrested. What is our target, the thing that we want to learn about?
A population is all of the individuals that we could collect data from. This is a real Humpty Dumpty definition: the words mean exactly what we want them to mean, no more, no less. When we say "all" of the individuals, we do mean every single one, even if they would be impossible to find or measure. A population is more of an idea than something we actually work with.
And when we say "individuals," we don't necessarily mean people. If we were studying hummingbird calls, then all the hummingbirds would be our population. Or, if we were only interested in the length of the bird calls, then it would be the bird calls themselves that would be our population. We can think about the individuals of a population as being "items of interest." By the way, you should come see our band, Items of Interest, this Saturday.
Some examples of populations are:
Obviously, it can be hard, or even impossible, to study every individual in a population. That's why we won't even try. Instead, we'll take a sample, a subset of the total population, and study that instead. This is actually the whole point of statistics—to be able to use a sample to make some conclusions about the population as a whole. And you thought it was all about mathematicians trying to trick people into paying attention to them.
When we sample a population, we're trying to learn about some parameter of the population as a whole. For instance, we might be curious about the average GPA of the students that read Shmoop. We could ask 1000 Shmoopers, a sample of the whole, about their GPA. The average GPA of our sample can be used as an estimate of the parameter in the whole population. We think that the estimate and the GPA would both be pretty good.
We can use all kinds of measurements as parameters and estimates. We can find the sample mean and use it to estimate the population mean, like in our GPA example. Yes, we can talk about multiple means at the same time. This gets confusing, so we have different symbols for the sample mean vs. the population mean: x vs. μ. Oh hey, we've seen these two before.
Other measurements we can use as parameters and estimates are proportions, the median, and the standard deviation. In each case, we use the values found from a sample to create an estimate for the population as a whole.
Not every sample is going to be a good sample. If we only ask the chess club members for their GPA, we're going to get a biased picture if we try to use that as an estimate. Nothing personal, chess people, but you're not very representative of the class as a whole.
The way to get an unbiased estimate is to create the sample using random sampling. We're not talking monkey ninja pirate zombie types of random, though. There are two things we have to do to get a random sample:
It's random like rolling dice, where every face has an equal chance of being rolled, and every roll of the die is independent of the one before and afterwards. As long as our sample is large enough, the results will be representative of the population as a whole.
Actually getting a random sample can be tough, though. If you're sampling wild flowers (maybe because you have a hot date and forgot to get a gift, you dog), it would be tempting to pull over to the side of the road and grab a whole clump of flowers. However, all the flowers away from the road are less likely to be picked, and flowers growing together are more likely to be picked. If your date wanted a random sample of flowers, and why wouldn't they, they're going to be disappointed.
Who loves a recap? We do.
What's the best way to collect the data for a sample? A butterfly net probably isn't the way to go, unless we're actually sampling butterflies. Putting up a flyer asking the data to come to us won't work. Well, if we need volunteers for a medical study, then it might work. Really, there are a lot of ways to collect data, and which one we use will depend on what we're collecting and why.
We've already talked about sample surveys a bit. That's when we collect a sample from a population for the purpose of estimating a parameter. Are your classmates planning a revolt against school? Run a poll and find out. It might be best if you stay home that day.
Have we mentioned how important random sampling is for sample surveys? It's not (just) that we're getting senile in our old age. Random sampling is so important that we'll mention it twice, and look old doing it. If a sample survey isn't a random sample, then we can't use it to estimate a population parameter. It's especially sad because that's their only job.
Sometimes, describing a population isn't enough. We want to test some crazy idea we have, like "Can pigs fly after taking an 8-week online course?" That's when we run an experiment.
To start, we randomly put individuals into two groups. Many experiments use a control group, that receives no treatment, and an experimental group that does. One set of pigs will take the online pilot's course, while the other pigs will take a course on something else. Maybe we'll teach those pigs some African History. Then we'll record the results of the two groups and compare them.
Experiments are great because they let us test cause-and-effect relationships. Does taking an online class on flying turn pigs into better pilots? It would be hard to figure this out without an experiment. It will probably be hard to figure out with an experiment, too. We're going to break a lot of planes before we're done.
Something that can't be done in our pig experiment, but that shows up in a lot of experiments, is double-blinding. When running the pig experiment, we the researchers knew which pigs were in the control group and which were in the experimental group. The pigs knew which group they were in as well. At least, we think they knew.
An example of a double-blind experiment is a drug trial where neither the doctors nor the patients know if they are taking the real drug or a placebo. A doctor might act differently with someone in the control group than someone in the experimental group. Or people might stop taking the drug if they know it's a placebo. In the land of the double blind, the doctor's assistants are king: that's because they're the ones that keep track of which individuals are in each group. Double-blinding helps reduce unconscious bias.
There are times when we can't collect a random sample. On a Saturday morning, for instance. We'd rather just sleep in. Other times we even have actual reasons for why we can't. If we're studying the health effects of smoking, we can't ethically assign some people to a smoking group. That might have flown in the 1930s, but not today.
In those cases, we can conduct an observational study instead. We'll take people who already belong to the group we're interested in, and we'll compare them to a control group. While the control group can be randomly selected, the group we're interested in can't be. Not to sound like a broken record, but no random sampling: no dice.
This means our sample probably won't be a representative sample. That's why we avoid observational studies like the plague, unless we have no choice. Then we'll only do it if we drink plenty of OJ and wash our hands obsessively, to avoid catching the observational study. Or the plague. One of those.