High School: Statistics and Probability
Interpreting Categorical and Quantitative Data HSS-ID.A.4
4. Use the mean and standard deviation of a data set to fit it to a normal distribution and to estimate population percentages. Recognize that there are data sets for which such a procedure is not appropriate. Use calculators, spreadsheets, and tables to estimate areas under the normal curve.
Students should already know that the distribution of data can take many forms. It can be symmetric, skewed, distributed uniformly, or follow a normal distribution, also known as a bell curve (think Liberty Bell, not jingle bell), also known as a Gaussian distribution. They don't have to know why a normal distribution has so many different names, although it couldn't hurt.
Students should know that we can describe normal distributions as frequency distributions by expressing the data points as percents instead of true values. For example, a cookie factory can produce and package 20,000 boxes of cookies in a month. Each box of cookies is supposed to weigh 22 ounces, but no cookie is perfect (although we'll argue that every cookie is perfect). The following histogram with 10 bins shows the actual weight of the cookie boxes in a month's worth of production. The data has a mean of 22 and a standard deviation of 1.0.
When given this data in the form of a table, students should be able to find the percentages or probabilities for each value. This results in a relative frequency distribution, where the y-axis of the histogram is between 0 and 1 (or 0 and 100%) and the sum of all the percentages is equal to 1 (or 100%).
We can fit a bell curve to this distribution.
When a normal curve is represented as a continuous line as the line in the figure above, it is called a continuous distribution. The area under the curve of the continuous distribution is always equal to 1.0 (just like if we add up all the percentages in the table above, the sum is 100%).
Students should know when it makes sense to talk about many values in terms of a continuous distribution. The weight of the cookie box, for instance, could be anywhere from 17 to 27 ounces (or less, or more), and the weight does not need to fall on an integer value. It could be 21.87 or 22.9 or 0 or 82,729. Now that's a lot of cookie.
Assuming a normal distribution, students should be able to approximate the shape of the continuous distribution given the average and the standard deviation. As the standard deviation increases, the bell shape begins to flatten out because a greater standard deviation suggests the data is spread out more from the mean.
Students should also know that 68% of the data will fall in between the points of inflection (which are exactly ±σ away from the mean). If we increase the distance to 2 standard deviations from the mean (±2σ), we will capture 95% of the data, and moving three standard deviations away captures 99.7% of the data. This is called the empirical rule.
The Z-score is the number of standard deviations a data point is away from the mean. It's a useful way to normalize all normal distributions. (And you thought they couldn't get more normal.) Students should be able to calculate a Z-score using the following formula.
Here, μ is the true mean, σ is the standard deviation, and x is the data point in question. If we pick a cookie box that weighs 25 ounces and we know that μ = 22 and σ = 1.0, we can determine the Z-score as:
The weight of this box is 3 standard deviations from the mean. If we know that 99.7% of the data lie within three standard deviations of the mean, what does that suggest about this box of cookies? It probably means we scored cookie box gold!
Students should be able to find the area under a portion of the curve (say, if our chances of finding a cookie box that weighs only 25 ounces or more), using the Z-score and a table to do it. More than that, they should understand that this area represents the probability that a random data point will fall within the described region.
Remind students that σ and Z are different, and to be careful about which table they're using to find the area under the curve (some tables are cumulative starting from -∞, and others start at the mean). They could also be reminded that the area under the entire curve is always 1.
Common sense is never over-rated. With all these variables and numbers and tables, it's easy to get confused. We don't need a calculator or a table to figure out that Prob(Z ≤ 0) = 0.5 because the chances of a random data point being less than the mean (Z ≤ 0) or greater than the mean (Z ≥ 0) are each 50%. If they understand what these variables and numbers and tables actually do, students are less likely to make silly errors and perform unnecessary calculations.
Here's a video resource teachers can use to explain normal distribution curve.
- Mean, Median, and Mode
- Normal Distribution Curve
- ACT Math 6.4 Pre-Algebra
- ACT Math 6.5 Pre-Algebra
- CAHSEE Math 6.4 Algebra and Functions
- CAHSEE Math 6.4 Algebra I
- CAHSEE Math 6.4 Mathematical Reasoning
- CAHSEE Math 6.4 Measurement and Geometry
- CAHSEE Math 6.4 Number Sense
- CAHSEE Math 6.4 Statistics, Data, and Probability I