Overfitting

  

Our cousin Jaime has put on a little weight recently, and as we sit across from him at a family dinner, we can’t help but notice that his favorite shirt is stretching apart at the seams trying to accommodate his expanding girth. We snicker and tell him it looks like his body is overfitting his clothes. He snickers back and tells us we used the word “overfitting” incorrectly.

He’s right, we did. “Overfitting” is actually what happens when we create a model based on data that can’t be extrapolated to the real world. It can’t predict; it can only draw conclusions based on the historical data upon which we built it. And if that data—our training data—isn’t reflective of reality, then our model won’t be, either. This usually happens as a result of one of two things: either we have way too many data parameters, or we have so few data points that our model treats the anomalies as part of the pattern it’s supposed to recognize and predict.

As an example, let’s get back to poor Jaime. Let’s say we’re trying to predict how many cans of Pepsi he’ll drink in a given day, so we keep a log for seven days. On Monday, Tuesday, and Thursday, he drank four cans. On Wednesday and Saturday, it was five. Friday was three, and Sunday was zero. Based solely on this data, our model might predict that Jaime won’t drink Pepsi on Sundays. In reality, though, he was sick that day and only drank green tea with honey and milk. That anomaly should not be a part of our Pepsi predictions, because it was just that: an anomaly. Our model overfits the data.

Find other enlightening terms in Shmoop Finance Genius Bar(f)