# High School: Statistics and Probability

### Interpreting Categorical and Quantitative Data HSS-ID.A.3

3. Interpret differences in shape, center, and spread in the context of the data sets, accounting for possible effects of extreme data points (outliers).

Students should already know that in order to describe and compare sets, we need to define the center and the spread of the data. These summaries are good first steps, but there is one more measure of the data that can give another clue and help us compare data sets: the shape of the data.

Students should know that mound-shaped distributions can be either symmetric or skewed either left of right, but a normal distribution is a mound shaped, symmetric curve. Students should also know the relation of mean and median to symmetric or skewed curves. If the mean is greater than the median, the data is skewed right. If the mean is less than the median, the data is skewed left. If the mean and median are equal, then the data is symmetric.

Students should realize that the shape of the data helps us find and identify outliers. An outlier is something that sticks out from the rest of the data, like an egg with two yolks. It's a data point that makes you furrow your brow and wonder if you measured wrong. Formally, an outlier is a data point that has an "extreme value" when compared with the rest of the data set.

Mathematically speaking, an outlier is defined as any point that falls 1.5 times the IQR below the lower quartile or 1.5 times the IQR above the upper quartile. To visualize what this means, we can use box plot with the data below. First, we sort the data from smallest to largest to find the lower quartile (Q1), median, and upper quartile (Q3).

Data: 37, 37, 38, 38, 40, 40, 42, 42, 42, 62

The median is 40.
Q1 = 38
Q3 = 42
Therefore, IQR = Q3Q1 = 42 – 38 = 4.

The box plot then looks like this:

If IQR = 4, then the lower limit on outliers is Q1 – 1.5 × IQR = 38 – 1.5 × 4 = 32 and the upper limit on outliers is Q3 + 1.5 × IQR = 42 + 4 × 1.5 = 48. We can add these as vertical lines in the box plot.

We can see that 62 is an outlier because it surpasses these limits. When there is an outlier on one side of the data set, we can chop the whisker off at the limit and then record the outliers as data points. So, the final box plot for this data set would look like this:

Students should understand that removing this outlier changes the mean significantly, but not the median. The absence or presence of outliers may make either the mean or median more representative of the center of data, and students should be able to choose which is more preferable depending on the data. They should also be able to identify outliers by calculating the limits based on the IQR, and give reasonable explanations for why outliers might exist within a particular context.

Here's a video resource that can be used by teachers to help explain normal distribution curve.

#### Drills

1. Given the following histogram, how can we describe the shape of the data?

Skewed Right

That data is skewed. In this case, the tail of the data is to the right of the location with the greatest frequency, so it is skewed to the right. When identifying skewed data, make sure to consider where the tail of the data is, not the location with the greatest frequency.

2. Given the following histogram, how can we describe the shape of the data?

Symmetric

Even though the right side and the left side of the histogram aren't identical, the data is still called symmetric. It's centered around a central point in the middle and there is no noticeable tail.

3. The mean of a data set is 12 and the median is 12. What are the possible shapes for this data set?

Both (A) and (B)

We do not know very much about this data except that the median and the mean are equal, and what does that mean? This suggests that the data is not skewed, so we can knock (C) off the list of possible answers. It means our data is symmetric. We don't know enough about the data to take (A) off the list of potential shapes so we'll keep it in so as not to hurt its feelings.

4. The mean of a data set is 12 and the median is 10. What shape is the data?

Skewed Right

When the mean and median are not equal or very close to equal, the data can't be considered symmetric, so let's get rid of (D) right off the bat. We're dealing with statistics, not geometry, so (A) is a no go as well. That leaves us with a data set that is skewed, but which way? Right or wrong—er, uh, left? When the mean is greater than the median, the data is skewed right, so (C) is the correct answer.

5. The median price of a home in your neighborhood is \$212,000. One fourth of the homes are less than \$200,000 and one fourth are greater than \$238,000. Which of the following home prices would be considered an unusually good deal?

Anything less than \$134,000

Let's assume that getting an "unusually good deal" means you are buying a home with a price that is an outlier. We are looking for a deal so the outlier would be an outlier to the left of the rest of the data (as in, cheaper). Given the information in the problem statement we can determine that IQR = 238 – 200 = \$38,000. So, to be an outlier, and a good deal, the home would need to be less than \$212,000 – 1.5 × 38,000 = \$134,000.

6. Given the data points 18, 14, 12, 14, 11, 11, 19, 20, 16, and 11, which values would be considered outliers?

Outliers must be less than 0.5 or greater than 28.5

To determine the upper and lower bound for outliers, we first need to determine the IQR, which requires finding the upper and lower quartiles. So many steps! But we can do it. After arranging our data in order, we find that the lower quartile is 11 and the upper quartile is 18. This means the IQR, or the range that contains half of the data is 7. The lower bound on outliers is the lower quartile minus 1.5 times the IQR, or 11 – 1.5 × 7 = 0.5. So it has to be (D). To verify this, we can determine the upper bound on outliers as the upper quartile plus 1.5 times the IQR, or 18 + 1.5 × 7 = 28.5.

7. The following data represents the highest temperatures on June 21 for the last 10 years. Should any of these temperatures be considered outliers?

Temperature (°F): 90, 94, 89, 92, 89, 103, 77, 92, 97, 90

77°F and 103°F

To determine if 103 should be considered an outlier (and we think it should be), first we need to determine Q1 and Q3. After organizing our data, we can see that Q1 = 89°F and Q3 = 94°F. So, IQR = 5. Therefore, the lower bound on outliers is Q1 – 1.5 × IQR = 89 – 1.5 × 5 = 81.5°F. The upper bound on outliers is Q3 + 1.5 × IQR = 94 + 1.5 × 5 = 101.5°F. So, anything less than 81.5°F and anything greater than 101.5°F are outliers. Indeed, 103°F is an outlier, and 77°F is also an outlier.

8. The following data represents the highest temperatures on June 21 for the last 10 years. Next year, another temperature will be added to the data. Which temperature would not be an outlier

Temperature (°F): 90, 94, 89, 92, 89, 103, 77, 92, 97, 90

95°F

We already know from the previous problem that anything less than 81.5°F and anything greater than 101.5°F are outliers. As nice as 81°F sounds, it's still an outlier, as is 79°F. Since 104°F is above 101.5°F, it's also an outlier. The only temperature that isn't an outlier is the not-too-hot-but-still-considerably-warm 95°F.

9. Is this data symmetric and mound shaped?

 25 25 26 26 27 27 28 28 29 29

The data is symmetric, but not mound shaped

To determine this, we can find the mean and median for this data set. If they are equal, the data is symmetric. In this case, the mean and median are both equal to 27, so the data is symmetric. That eliminates (B) and (C). Looking at the data points, it's clear that there are two 25's, two 26's, two 27's, two 28's, and two 29's. The frequency doesn't increase or decrease. It's constant, which is not mound-like at all. That means (D) is the right answer.

10. Which of the following points, when added to the data set, will create a data set that is skewed left?

 25 25 26 26 27 27 28 28 29 29