High School: Statistics and Probability

Interpreting Categorical and Quantitative Data HSS-ID.B.5

5. Summarize categorical data for two categories in two-way frequency tables. Interpret relative frequencies in the context of the data (including joint, marginal, and conditional relative frequencies). Recognize possible associations and trends in the data.

Whatever way we spin it, statistics is about numbers. So obviously, it makes sense that statisticians use a lot of numerical data (height, weight, age, etc.), but even that gets too easy after a while. Data that isn't represented numerically is known as categorical data (eye color, hair color, sex, etc.).

Although it may seem like there isn't much we can do with categorical data (after all, how can we analyze a person's brown eye color?), statisticians would beg to differ. Well, there's a 92% chance they'd beg to differ, anyway.

Students should know what to do with categorical data and how to analyze it. Students should be able to analyze data from two different categories. For instance, data collected from both men and women about their favorite DC comic book superheroes (Wonder Woman, Batman, or Superman) can be summarized in one table.

This table is a two-way frequency table because we can break the data down into 2 categories: male or female (100 of them are male and 100 of them are female), or by favorite superhero (87 prefer Wonder Woman, 63 prefer Batman and 50 prefer Superman).

Students should also be able to convert this data into a two-way relative frequency table:

Students should know what these numbers mean. The numbers in the middle are called joint probabilities because they depend on more than one category or event occurring at the same time. In this case, we want to know if a person is male or female and which superhero they prefer. So written in math language, each entry in the table represents P(Sex & Superhero).

The marginal probabilities represent the probability of only one category, P(Sex) or P(Superhero). They're called marginal because they're on the margins of the table. Duh.

If we know the data for one category and not the other (say, we know the person is male, but not which superhero they prefer), we can calculate the probability that his favorite superhero is Superman. This is called a conditional probability because it is conditional on knowing part of the data. We write this in math language as P(SM|Male). (The | symbol means "given.")

We can calculate P(SM|Male) from the frequency table because we know 0.50 of the people surveyed were men, and 0.175 of the people surveyed were both male and preferred Superman. We can use both of these values to determine:

Students should feel comfortable creating, understanding, and using these tables to calculate probabilities for more than two categories. They should also be able to determine the probability of combinations (for instance, P(F) = P(F & WW) + P(F & BM) + P(F & SM) = 0.310 + 0.115 + 0.075 = 0.5) and negations (such as P(F & WW'), meaning the probability that a surveyed person is female and does not prefer Wonder Woman, as P(F & WW') = P(F & BM) + P(F & SM) = 0.115 + 0.075 = 0.190).

Remind your students that these probabilities express the probability that a random person surveyed satisfies whatever categories are described in the parenthesis.

Although many of these topics seem obvious or implicit, some students will have difficulty understanding the difference between numerical and categorical data or analyzing the frequency table. It's best to explain these topics with multiple examples, stressing the similarities and differences so that students understand what's important and what isn't. For example, while the number of categories isn't going to be the same in every example, the data will always be categorical.

Drills

1. Which of the following is categorical data?

Hair color

Height? That's numerical. Weight? Numerical. Hair color? While graphic designers might be able to assign a numerical value to a color, it is, in fact categorical data. Shoe size is a number too, so we will call it numerical. The only categorical data is hair color.

2. The following table summarizes the hair color of a baseball team. What is the probability that a player has brown hair?

0.48

That poor old guy with gray hair. Or maybe he grayed early, like Anderson Cooper or George Clooney? Assuming no brown-haired players use hair dye, there are 15 players with brown hair and 31 players total. The probability that a player will have brown hair is P(Brown) = browntotal = 1531 = 0.48.

3. The following table summarizes the hair color of a baseball team. What is the probability that a player does not have gray hair?

0.97

Why are we still picking on the gray-haired player? He is obviously in the minority. To determine the probability that a player does not have gray hair, P(Gray'), we take 1 – P(Gray) because we know the total probability for gray and non-gray hair is equal to 1. Players either have gray hair or they don't. In this case, the correct answer is 1 – 131 = 1 – 0.03 = 0.97.

4. The following table summarizes the number of students in a class that received different letter grades on 2 recent exams. The first exam is shown across the top and is summarized as A1, B1, and C1, and the second exam is in the first column, A2, B2, and C2. What is the probability that a student gets an A on the first exam and a B on the second exam?

0.067

In this case we are looking for the P(A1 & B2), the probability that someone gets an A on the first exam and a B on the second exam. Guess the student got too confident, stopped Shmooping, let her skills slip. Either way, there were 2 students out of 30 who got this combination of scores so the correct answer is .

5. The following table summarizes the number of students in a class that received different letter grades on 2 recent exams. The first exam is shown across the top and is summarized as A1, B1, and C1, and the second exam is in the first column, A2, B2, and C2. What is the probability that a student gets a C on the first exam and an A on the second exam?

3.33%

A C on the first exam and an A on the second? They must be Shmoopers. We are looking for P(C1 & A2). Looking at the table reveals that only 1 student who originally got a C earned an A on the second exam. So the answer is  or 3.33%, so the right answer is (C).

6. The following table summarizes the number of students in a class that received different letter grades on 2 recent exams. The first exam is shown across the top and is summarized as A1, B1, and C1, and the second exam is in the first column, A2, B2, and C2. What is the probability that a student gets a B on the second exam?

0.5

In this case we are looking for a marginal probability because it only depends on one category: the second exam. We want to isolate just the results from the second exam and look at how many of the 30 students earned a B on only that exam. If we ignore what the students earned on the first exam, there were 15 students who earned a B on the second exam (because 2 + 10 + 3 = 15), and we know that 15 is half of 30.

7. The following table summarizes the number of students in a class that received different letter grades on 2 recent exams. The first exam is shown across the top and is summarized as A1, B1, and C1, and the second exam is in the first column, A2, B2, and C2. What is the probability that a student gets a C on the first exam?

0.1667

Another marginal probability problem! This time we will ignore the scores earned on the second exam and only look at scores from the first exam. In this case, 5 out of 30 students, or 16.67% earned a C on the first exam.

8. The following table summarizes the number of students in a class that received different letter grades on 2 recent exams. The first exam is shown across the top and is summarized as A1, B1, and C1, and the second exam is in the first column, A2, B2, and C2. Given that a student earns an A on the first exam, what is the probability he or she earns a C on the second exam?

20%

Here, we have a case of conditional probability. Given one condition for certain (the student earns an A on the first exam), what is the probability that the student earns a C on the second exam? We will confine our population to the students who earn an A on the first exam, which is 2 + 2 + 1 = 5 students. Of these 5, just 1 of them must have fallen asleep or missed a few classes and earned a C on the second exam. So the answer is 1 out of 5, or 0.20. We can also use the mathematical equation for conditional probabilities: .

9. The following table summarizes the number of students in a class that received different letter grades on 2 recent exams. The first exam is shown across the top and is summarized as A1, B1, and C1, and the second exam is in the first column, A2, B2, and C2. Given that a student earns a B on the first exam, what is the probability he earns a B on the second exam as well?

1 out of 2

This is another conditional probability. The first given condition is that the student earns a B on the first exam. There are 7 + 10 + 3 = 20 students who do so. Of these 20, we want to determine who gets a B on the second exam. Since 10 out of 20 do, this is the same as (C). So this means that half of the students who earn a B on the first exam will earn a B on the second exam. In mathspeak, , which is 1 out of 2.

10. The following table summarizes the number of students in a class that received different letter grades on 2 recent exams. The first exam is shown across the top and is summarized as A1, B1, and C1, and the second exam is in the first column, A2, B2, and C2. Given that a student earns a B on the first exam, what is the probability that she does not earn a C on the second exam?

0.85

Tricky, tricky! This is another conditional probability problem. The given condition is that the student earns a B on the first exam. We are trying to determine the probability of the students who earn a B on the first exam to earn an A or a B on the second (since not a C is either an A or B). So, in math language we are trying to determine P(C2' | B1). Recall that:

We need to figure out P(C2' & B1). Intuition may tell you that the answer is 17, the number of people who got either an A or a B on the second exam and an B on the first exam. Mathematically this is:

.