# Basic Statistics & Probability

### Topics

**Box-and-whisker plots** are a handy way to display data broken into four quartiles, each with an equal number of data values. The box-and-whisker plot doesn't show frequency, and it doesn't display each individual statistic, but it clearly shows where the middle of the data lies. It's a nice plot to use when analyzing how your data is skewed.

There are a few important vocabulary terms to know in order to graph a box-and-whisker plot. Here they are:

**Q1 – quartile 1**, the median of the lower half of the data set**Q2 – quartile 2**, the median of the entire data set**Q3 – quartile 3**, the median of the upper half of the data set**IQR – interquartile range**, the difference from Q3 to Q1**Extreme Values**– the smallest and largest values in a data set

Let's start by making a box-and-whisker plot (also known as a "box plot") of the geometry test scores we saw earlier:

90, 94, 53, 68, 79, 84, 87, 72, 70, 69, 65, 89, 85, 83, 72

**Step 1: Order the data from least to greatest**.

**Step 2: Find the median of the data.**

This is also called quartile 2 (Q2).

**Step 3: Find the median of the data less than Q2.**

This is the lower quartile (Q1).

**Step 4. Find the median of the data greater than Q2.**

This is the upper quartile (Q3).

**Step 5. Find the extreme values**: these are the largest and smallest data values.

Extreme values = 53 and 94.

**Step 6. Create a number line that will contain all of the data values.**

It should stretch a little beyond each extreme value.

**Step 7. Draw a box from Q1 to Q3 with a line dividing the box at Q2.** Then extend "whiskers" from each end of the box to the extreme values.

This plot is broken into four different groups: the lower whisker, the lower half of the box, the upper half of the box, and the upper whisker. Since there is an equal amount of data in each group, each of those sections represent 25% of the data.

Using this plot we can see that 50% of the students scored between 69 and 87 points, 75% of the students scored lower than 87 points, and 50% scored above 79. If your score was in the upper whisker, you could feel pretty proud that you scored better than 75% of your classmates. If you scored somewhere in the lower whisker, you may want to find a little more time to study.

## Outliers

**Outliers **are values that are *much* bigger or smaller than the rest of the data. These are represented by a dot at either end of the plot. Our geometry test example did not have any outliers, even though the score of 53 seemed much smaller than the rest, it wasn't small *enough*.

In order to be an outlier, the data value must be:

- larger than Q3 by at least 1.5 times the interquartile range (IQR), or
- smaller than Q1 by at least 1.5 times the IQR.

Below are the individual final results for the men's large hill ski jumping event at the Winter Olympics.

283.6 | 269.4 | 262.2 | 261.2 | 246.7 |

245.5 | 239.2 | 233.7 | 230.3 | 227.9 |

226.4 | 225.5 | 224.1 | 223.6 | 222.3 |

221.4 | 217.8 | 217.2 | 216.9 | 211.6 |

211.4 | 208.5 | 204.9 | 202.7 | 202.4 |

200.5 | 198.5 | 182.4 | 111 |

The quartiles are as follows: Q1 is 208.5, Q2 is 222.3, and Q3 is 236.45. The lowest score (111) seems like it might be an outlier, since it is so much smaller than the rest of the data. However, we can't be sure until we check.

First, we must calculate the IQR, which is Q3 – Q1. Then we multiply that by 1.5 to get the number needed for our analysis of a possible outlier.

IQR = 236.45 - 208.50 = 27.95

1.5(IQR) = 1.5(27.95) = 41.93

208.5 - 41.93 = 166.57

So, for the number in question (111) to qualify as an outlier in this example, it would have to be *less than* 166.57, which is the difference between Q1 (which is 208.5) and 41.93.

It is! Since 111 is less than 166.57, 111 is officially an outlier. When we make a box-and-whisker plot of this data, we represent 111 with a dot and only extend the lower whisker to the next smallest data value (182.4).

We probably should have checked to make sure that there aren't any outliers in the upper half of the data:

236.45 + 41.93 = 278.38

There is one value about 278.38 so it is an outlier as well.