Wednesday, July 18, 2012

Matching the data to the predictions

The results are in. And one prediction held up and the other fell flat.

First the falling flat:

As you can see from the top two plots, the highest and second highest scores tended to come in the first inning, not later in the game. In fact, there's a strong drop at the end, suggesting that relief pitchers do help keep the score down.

I conclude that the high score doesn't come when the pitcher gets tired. I was wrong. In fact, it looks more like the pitcher isn't quite in the game yet, or the team hasn't gotten synchronized for play yet.

Now let's look at the second-highest score distributions, to see what we can see.

The distributions of highest and second-highest are roughly similar, except that the second-highest inning's score is less than half that of the highest inning's score, on the average. And the reduced score distribution for 7 innings (the extra inning games are a small effect here and I ignore the correction) looks much more Poisonnian, dropping by about a factor of 2 each time. That's on the lower right: compare with the 8-inning reduced score on the lower left.

I used the 7-inning reduced score distribution to predict the highest and second-highest innings' scores, and those are in the bottom row of the figure below. Recall that when you are looking at the highest score of a list, 0 is not as likely as higher values. And they have the same general shape as the real distributions above, but with much smaller averages. The average real highest inning's score is 2.3 but the estimate's is 0.7, and the average real second highest inning's score is 1.1 but the estimate's is 0.5.

What can I conclude from this?

The highest inning's score is in substantial disagreement with the estimate. So is the second-highest inning's score, though the real distribution could be a combination of the background estimate and something else.

  1. The first inning is the problematic one: either the team isn't together or the pitcher isn't.
  2. The distribution of scores in the highest scoring inning is not attributable to chance. Something is different during that inning
  3. The second-highest scoring inning is partly like the highest-scoring inning and partly like chance.
  4. The distributions don't look much different between home team and visiting team


Grandma Bee said...

And this is one of the things I love about baseball--it defies formula. You can compute the arc and the distance of a fly ball but not whether the outfielder will catch it or lose it in the sun. In baseball, all constants are variables.

Assistant Village Idiot said...

Very cool.

The statement is usually that the pitcher "doesn't have it" that day, implication being he wouldn't improve if he stayed in. Your suggestion that some acclimitasation to the game may be harder than it appears at first seems more sound.

However, the first inning is also the one in which the highest OBP hitters are guaranteed to come up. That doesn't look like it would explain even half your difference, but it might be 25%, from a seat-of-the-pants estimate looking at your stuff.

james said...


The second-highest scoring inning is more than twice as likely to come immediately after the first-highest as any other time. Maybe that's the fruition of the offensive lineup strategy, or maybe the team/pitcher goes into a temporary slump.