High School: Statistics and Probability
Interpreting Categorical and Quantitative Data HSS-ID.B.5
5. Summarize categorical data for two categories in two-way frequency tables. Interpret relative frequencies in the context of the data (including joint, marginal, and conditional relative frequencies). Recognize possible associations and trends in the data.
Whatever way we spin it, statistics is about numbers. So obviously, it makes sense that statisticians use a lot of numerical data (height, weight, age, etc.), but even that gets too easy after a while. Data that isn't represented numerically is known as categorical data (eye color, hair color, sex, etc.).
Although it may seem like there isn't much we can do with categorical data (after all, how can we analyze a person's brown eye color?), statisticians would beg to differ. Well, there's a 92% chance they'd beg to differ, anyway.
Students should know what to do with categorical data and how to analyze it. Students should be able to analyze data from two different categories. For instance, data collected from both men and women about their favorite DC comic book superheroes (Wonder Woman, Batman, or Superman) can be summarized in one table.
This table is a two-way frequency table because we can break the data down into 2 categories: male or female (100 of them are male and 100 of them are female), or by favorite superhero (87 prefer Wonder Woman, 63 prefer Batman and 50 prefer Superman).
Students should also be able to convert this data into a two-way relative frequency table:
Students should know what these numbers mean. The numbers in the middle are called joint probabilities because they depend on more than one category or event occurring at the same time. In this case, we want to know if a person is male or female and which superhero they prefer. So written in math language, each entry in the table represents P(Sex & Superhero).
The marginal probabilities represent the probability of only one category, P(Sex) or P(Superhero). They're called marginal because they're on the margins of the table. Duh.
If we know the data for one category and not the other (say, we know the person is male, but not which superhero they prefer), we can calculate the probability that his favorite superhero is Superman. This is called a conditional probability because it is conditional on knowing part of the data. We write this in math language as P(SM|Male). (The | symbol means "given.")
We can calculate P(SM|Male) from the frequency table because we know 0.50 of the people surveyed were men, and 0.175 of the people surveyed were both male and preferred Superman. We can use both of these values to determine:
Students should feel comfortable creating, understanding, and using these tables to calculate probabilities for more than two categories. They should also be able to determine the probability of combinations (for instance, P(F) = P(F & WW) + P(F & BM) + P(F & SM) = 0.310 + 0.115 + 0.075 = 0.5) and negations (such as P(F & WW'), meaning the probability that a surveyed person is female and does not prefer Wonder Woman, as P(F & WW') = P(F & BM) + P(F & SM) = 0.115 + 0.075 = 0.190).
Remind your students that these probabilities express the probability that a random person surveyed satisfies whatever categories are described in the parenthesis.
Although many of these topics seem obvious or implicit, some students will have difficulty understanding the difference between numerical and categorical data or analyzing the frequency table. It's best to explain these topics with multiple examples, stressing the similarities and differences so that students understand what's important and what isn't. For example, while the number of categories isn't going to be the same in every example, the data will always be categorical.