# High School: Statistics and Probability

### Interpreting Categorical and Quantitative Data HSS-ID.A.4

4. Use the mean and standard deviation of a data set to fit it to a normal distribution and to estimate population percentages. Recognize that there are data sets for which such a procedure is not appropriate. Use calculators, spreadsheets, and tables to estimate areas under the normal curve.

Students should already know that the distribution of data can take many forms. It can be symmetric, skewed, distributed uniformly, or follow a normal distribution, also known as a bell curve (think Liberty Bell, not jingle bell), also known as a Gaussian distribution. They don't have to know why a normal distribution has so many different names, although it couldn't hurt.

Students should know that we can describe normal distributions as frequency distributions by expressing the data points as percents instead of true values. For example, a cookie factory can produce and package 20,000 boxes of cookies in a month. Each box of cookies is supposed to weigh 22 ounces, but no cookie is perfect (although we'll argue that every cookie is perfect). The following histogram with 10 bins shows the actual weight of the cookie boxes in a month's worth of production. The data has a mean of 22 and a standard deviation of 1.0.

When given this data in the form of a table, students should be able to find the percentages or probabilities for each value. This results in a relative frequency distribution, where the y-axis of the histogram is between 0 and 1 (or 0 and 100%) and the sum of all the percentages is equal to 1 (or 100%).

We can fit a bell curve to this distribution.

When a normal curve is represented as a continuous line as the line in the figure above, it is called a continuous distribution. The area under the curve of the continuous distribution is always equal to 1.0 (just like if we add up all the percentages in the table above, the sum is 100%).

Students should know when it makes sense to talk about many values in terms of a continuous distribution. The weight of the cookie box, for instance, could be anywhere from 17 to 27 ounces (or less, or more), and the weight does not need to fall on an integer value. It could be 21.87 or 22.9 or 0 or 82,729. Now that's a lot of cookie.

Assuming a normal distribution, students should be able to approximate the shape of the continuous distribution given the average and the standard deviation. As the standard deviation increases, the bell shape begins to flatten out because a greater standard deviation suggests the data is spread out more from the mean.

Students should also know that 68% of the data will fall in between the points of inflection (which are exactly ±σ away from the mean). If we increase the distance to 2 standard deviations from the mean (±2σ), we will capture 95% of the data, and moving three standard deviations away captures 99.7% of the data. This is called the empirical rule.

The Z-score is the number of standard deviations a data point is away from the mean. It's a useful way to normalize all normal distributions. (And you thought they couldn't get more normal.) Students should be able to calculate a Z-score using the following formula.

Here, μ is the true mean, σ is the standard deviation, and x is the data point in question. If we pick a cookie box that weighs 25 ounces and we know that μ = 22 and σ = 1.0, we can determine the Z-score as:

The weight of this box is 3 standard deviations from the mean. If we know that 99.7% of the data lie within three standard deviations of the mean, what does that suggest about this box of cookies? It probably means we scored cookie box gold!

Students should be able to find the area under a portion of the curve (say, if our chances of finding a cookie box that weighs only 25 ounces or more), using the Z-score and a table to do it. More than that, they should understand that this area represents the probability that a random data point will fall within the described region.

Remind students that σ and Z are different, and to be careful about which table they're using to find the area under the curve (some tables are cumulative starting from -∞, and others start at the mean). They could also be reminded that the area under the entire curve is always 1.

Common sense is never over-rated. With all these variables and numbers and tables, it's easy to get confused. We don't need a calculator or a table to figure out that Prob(Z ≤ 0) = 0.5 because the chances of a random data point being less than the mean (Z ≤ 0) or greater than the mean (Z ≥ 0) are each 50%. If they understand what these variables and numbers and tables actually do, students are less likely to make silly errors and perform unnecessary calculations.

Here's a video resource teachers can use to explain normal distribution curve.

#### Drills

1. You purchased 10 baskets of strawberries at the local farmer's market and counted the number of strawberries in each basket. Based on your purchases, do you think the number of strawberries in a basket is normally distributed?

Yes

All of the above? How is that even possible? This data is normally distributed. The first way to determine this is to calculate the mean and the median, which are both equal to 20. This is a big clue that the data is normally distributed. But, just to make sure, we should check to make sure we are dealing with a normal (Liberty Bell) curve. In this case, we can see that the mound will occur at 20.

2. Andy Lee is the punter for the San Francisco 49ers. He had a stellar 2011 season with an average punt length of 50.9 yards with a standard deviation of 3.5. His punt distance follows a normal distribution. Determine the range of punt distances that covers 68% of the distances.

47.4 to 54.4 yards

Because this data follows a normal distribution, we can use the empirical rule. We know that 68% of the data will be within one standard deviation of the mean in both the positive and negative directions. The mean is an amazing 50.9 yards with a decent standard deviation of 3.5 yards. So one standard deviation below is 50.9 – 3.5 = 47.4 and one standard deviation above is 50.9 + 3.5 = 54.4.

3. Andy Lee is the punter for the San Francisco 49ers. He had a stellar 2011 season with an average punt length of 50.9 yards with a standard deviation of 3.5. His punt distance follows a normal distribution. Determine interval that contains 95% of data.

43.9 to 57.9 yards

Because this data follows a normal distribution, we can use the empirical rule. We know that 95% of the data will be within two standard deviations of the mean in both the positive and negative directions. The mean is an amazing 50.9 yards with a decent standard deviation of 3.5 yards. So two standard deviations below is 50.9 – 2(3.5) = 43.9 and two standard deviations above is 50.9 + 2(3.5) = 57.9.

4. Andy Lee is the punter for the San Francisco 49ers. He had a stellar 2011 season with an average punt length of 50.9 yards with a standard deviation of 3.5. His punt distance follows a normal distribution. In the very last game of the post-season, Andy Lee made his last punt for 39 yards. What is the Z-score for this punt?

-3.4

Remember that the Z-score, which also normalizes the value is Z = x- μσ. In this case, . The sign is important! The negative Z-score tells us the value is less than the mean (rather than greater than the mean).

5. Andy Lee is the punter for the San Francisco 49ers. He had a stellar 2011 season with an average punt length of 50.9 yards with a standard deviation of 3.5. His punt distance follows a normal distribution. Andy Lee's first punt of the season was 66 yards. What is the Z-score for this punt?

4.3

Remember that the Z-score, which also normalizes the value is Z = x- μσ. In this case, Z = 66 - 50.93.5 = 4.3. The sign is important! The positive Z-score tells us the value is greater than the mean, which we know is true because 66 > 50.9.

6. The length of a phone conversation, measured in minutes, follows a normal distribution with a mean of 7 minutes and a standard deviation of 2.2 minutes. What is the probability that the phone conversation lasts less than 8.5 minutes?

0.75

Logically, we know that some conversations last more than 8.5 minutes, so (D) can't be right. We also know that the majority of phone conversations last 8.5 minutes or below, so (C) can't be right, either. To determine the probability that a conversation is less than or equal to 8.5 minutes, first we need to determine the Z-score of 8.5 minutes which is Z = 8.5-7.02.2 = 0.68. Using a standard normal table reveals that this is associated with a probability of 0.75, so the correct answer is (A).

7. The length of a phone conversation, measured in minutes, follows a normal distribution with a mean of 7 minutes and a standard deviation of 2.2 minutes. What is the probability that the phone conversation lasts more than 15 minutes?

0.00014

To determine the probability that a conversation is greater than 15 minutes, we need to determine the probability that the phone conversation is less than or equal to 15 and then subtract that value from 1.0. The Z-score table reveals that the probability that a conversation is less than or equal to 15 minutes is 0.9996, so the probability that the conversation is greater than 15 minutes is 0.00014, or (B). This makes sense, since the probability of a long phone conversation is very small, meaning definitely not (A) or (D).

8. The length of a phone conversation, measured in minutes, follows a normal distribution with a mean of 7 minutes and a standard deviation of 2.2 minutes. What is the probability that the phone conversation lasts between 8.5 and 15 minutes?

0.2498

To determine the probability that a conversation lies between the two values, we determine the probability that the conversation is less than or equal to 15 minutes and subtract the probability that it is less than or equal to 8.5 minutes. This is 0.9996 – 0.75 = 0.2498. We could also consider this logically, knowing that (A) can't be right because the number of conversations less than 8.5 minutes is already 0.75 (meaning the probability of conversations over 8.5 minutes can't exceed 0.25), (B) can't be right for the same reason, and (D) makes no sense since some conversations are under 8.5 minutes, and some are over 15.

9. The class mean for a recent chemistry exam was 80.5 with a standard deviation of 4.2. What is the Z-score of a student who receives an 87 on the exam?

1.54

We can calculate the Z-score using the formula Z = x- μσ = 87-80.54.2 = 1.54. In other words, the student's score of 87 is 1.54 standard deviations away from the average score in the class. Since the Z-score is positive, that means the value is greater than the average.

10. The class mean for a recent chemistry exam was 85.5 with a standard deviation of 2.2. What is the Z-score of a student who receives an 87 on this exam?