# High School: Statistics and Probability

### Interpreting Categorical and Quantitative Data HSS-ID.A.2

2. Use statistics appropriate to the shape of the data distribution to compare center (median, mean) and spread (interquartile range, standard deviation) of two or more different data sets.

One of the main reasons for collecting data is so it can be compared to other data. Sounds like a dream come true, doesn't it?

Comparing data allows us to make big statements. After all, how can we be sure that using Shmoop increases test scores if we don't have two sets of data, pre-Shmoop and post-Shmoop, to compare?

Rather than comparing entire data sets, however, we can summarize the data and compare these summaries. That way, rather than comparing long and seemingly never-ending lists, we can compare two very basic factors that tell us a lot about the data: the center and spread of the data.

The center of the data is exactly what it sounds like: a representation of the middle of the data, or a typical value. It gives us a good first guess as to where on the number line the data will fall. Students should know the two types of centers of data: mean and median. The mean, or average, is the sum of all the data points divided by the number of data points, while the median is the value that splits the data into two intervals.

Students should know that the center of data can give us a good sense of the data set overall. For instance, we'll know that the heights of buildings are more closely represented by an average of 100 feet than by an average of 100,000 feet. Still, the center of data doesn't tell us the whole story. Let's say we have the following two sets of data:

Set 1: 4, 5, 6, 4, 6, 5
Set 2: 1, 9, 2, 8, 0, 10

Both of these data sets have an average of 5, but the first set only has values between 4 and 6 and the second data set has values between 0 and 10, a much wider range. This is "wideness" or "breadth" of the data is represented by the spread of data, and that's the second aspect students should consider when summarizing data.

Students should know how to use the interquartile range and standard deviation to describe the spread of data. The interquartile range (IQR) is the range that spans the middle fifty percent of the data. To determine the IQR, the lower (Q1) and upper (Q3) quartile need to be determined. Once that is done, IQR = Q3Q1.

The standard deviation, denoted by σ, is the spread of the data away from the mean of a set of data. If you could simultaneously move away from the mean in both directions, when you had traveled the distance of the standard deviation in both directions, then 68% of the data would be between you and your clone (in a normal distribution, anyway).

Students should know that the mean has the formula

and the standard deviation has the formula

Often, the mean and standard deviation are used together and the median and interquartile range are used together. Students should know that the mean and standard deviation are most frequently used when the distribution of data follows a bell curve (normal distribution), shown below.

Students should understand that the larger the values of the IQR or standard deviation, the larger the spread of the data is. If students are struggling with why this is so, show them mathematically using the formulas (since the quartiles are further apart, or the differences between the data points and the mean are further apart). Now, rather than comparing tables of dozens or even hundreds of numbers, we just need to compare two.

Here's a resource teachers can use to help explain normal distribution curve.

#### Drills

1. The mean and median of data sets are calculated in order to do which of the following?

Compare the center location of the data

We calculate the mean and median to get an estimate of where on the number line the data set will fall. The spread is compared using IQR and standard deviation, the maximum is used to compare the maximum and there is no "best" data set. (Well, most of the time, anyway.)

2. The interquartile range of a set of 18 data points will contain how many of the data points?

9

The IQR contains half of the data, so half of the 18 data points is 9 data points. The 9 data points are between the lower and higher quartiles. The other 9 are outside, between the minimum and the lower quartile and the higher quartile and the maximum.

3. The time it takes you to get to school in the morning follows a normal distribution. The following table lists the number of minutes it took you to drive to school. On average, how long does it take you to drive to school?

24

We're looking for the average, or mean, as our center of data. We can calculate the mean by adding up all the values of the data points and dividing by the number of data points. If we do the arithmetic properly, we should end up with x = 24010 = 24. It takes you an average of 24 minutes to drive to school.

4. The time it takes you to get to school in the morning follows a normal distribution. The following table lists the number of minutes it took you to drive to school. What is the standard deviation for your last 10 drives to school?

4.5

How awesome would it be if the standard deviation was 0? That would mean your drives to school would take the exact same amount of time, no matter what. Major skills. Instead, its fairly common that it might take you 4.5 minutes longer or shorter. We can calculate that using the standard deviation formula.

5. The time it takes you to get to school in the morning follows a normal distribution. The following table lists the number of minutes it took you to drive to school. What is the median of this data?

24.5

To find the median, we should arrange the data in order of smallest to greatest: 16, 20, 21, 22, 24, 25, 25, 27, 28, 32. The median is simply the number in the middle of the entire set. Since we have an even number of data points, it's the average of the fifth and sixth terms (24 and 25). Halfway in between them is 24.5, which is our answer.

6. The time it takes you to get to school in the morning follows a normal distribution. The following table lists the number of minutes it took you to drive to school. What is the interquartile range of this data?

7

To find the IQR, we need to find the lower and higher quartiles Q1 and Q3. If we arrange our data in increasing order, it'll be way easier to do this: 16, 20, 21, 22, 24, 25, 25, 27, 28, 32. Since we have 10 data points, our Q1 will be at 2.5 (between 2 and 3), and Q3 will be at 7.5 (between 7 and 8). In other words, our Q1 = 20.5 and Q3 = 27.5. Then, we know that IQR = Q3Q1 = 27.5 – 20.5 = 7. That means half of our data points are within an interval of 7.

7. Below are the scores two different sections of a vocabulary quiz. Given that the distribution of scores follows a normal distribution, what is a measure of the center of data for each of these sections?

Section 1:

Section 2:

Section 1: 15.2; Section 2: 12.9

The mean for Section 1 is 15.2 and the mean for section 2 is 12.9. Notice that the problem description tells us that the distribution follows a normal distribution, therefore we use the mean as the measure of the center of the data set. In (B) and (C), one of each of the values is the median instead of the mean. In most cases, it's best to use the same type of center (either mean or median) when comparing two sets of data.

8. Below are the scores two different sections of a vocabulary quiz. Given that the distribution of scores follows a normal distribution, which section had a greater spread in the data?

Section 1:

Section 2:

Section 2 had a greater standard deviation (σ = 3.33) than Section 1's standard deviation (σ = 3.01). This means that in addition to having a lower mean, Section 2 had a greater disparity from the worst score to the best score. Again, since this was a normal distribution, the standard deviation was used as opposed to the IQR.

9. Billy-Joe Bob and Bobby-Joe Bill are having a contest to see whose chickens provide more eggs. Over the course of 10 days, the farmers each count and record the number of eggs they collect. The data sets do not follow a normal distribution. Which farmer has a greater median number of eggs?

Billy-Joe Bob: 28, 21, 8, 15, 6, 18, 16, 30, 25, 17

Bobby-Joe Bill: 27, 28, 15, 28, 28, 23, 20, 8, 14, 8

Bobby-Joe Bill

If we arrange the data for each farmer in order, we get 17.5 as the median for Billy-Joe Bob and 21.5 as the median for Bobby-Joe Bill. The median for Bobby-Joe Bill is 21.5 eggs collected compared to Billy-Joe Bob's measly 17.5 eggs collected. If we ever end up on a farm, sign us up for Bobby-Joe's chicken rearing methods.

10. It turns out in his quest to be the best, Bobby-Joe Bill embellished the results from his egg collecting efforts by adding 5 eggs to the days that had fewer than 20 eggs. What was the median number of eggs he really collected? The false data is given below.

Billy-Joe Bob: 28, 21, 8, 15, 6, 18, 16, 30, 25, 17

Bobby-Joe Bill: 27, 28, 15, 28, 28, 23, 20, 8, 14, 8

16.5

You can't even trust farmer's these days. The correct data set should have looked like this: 27, 28, 10, 28, 28, 18, 15, 3, 9, 3. The correct median is 16.5 so the answer is (C). And Billy-Joe Bob should have won the contest.