Wednesday, December 09, 2009

I don't see how this could work

We're instructed that although North America has seen lower temperatures than normal for the last decade, the southern hemisphere has been warmer.

For a look at some of the raw data, and a reasonable analysis thereof, look at this fellow's check. It illustrates the problems associated with weather monitoring: variations, incomplete coverage, unexplained changes in apparent baseline... Take that last point. One station moved its monitor's position one year (which can change the average temperature!). And lo and behold, there was an apparent change in the baseline temperature--but not at exactly that year. What do you do with that? Something happened--maybe somebody put a birdbath under the sensor, or maybe somebody got rid of the barbecue pit. Unfortunately there were no other temperature monitors around that year, so you can't cross check.

Name your choice: don't correct the data, take averages before and after and lower the earlier data by the difference, or ditto and raise the later data. Or you could show that something bad happened and leave it out completely.

As a hint to the reader, more stations came online later, and their data agree very tightly with each other and with the old station--so the more recent numbers are more likely to be correct. Therefore you should either leave the data alone or lower the old values. Without any smoking gun to explain the difference I'd leave it alone.

That isn't what was done. In fact, what happened next is a little hard to credit. Correction terms were added to all the recent data. When you have 5 stations that corroborate each other you'd think that was fairly solid, but apparently it didn't match what some model said, so it was shifted to agree with some other (unreferenced) stations.

That is not the right way to do comparisons. Aside from the obvious bias in the data that you feed into your model, it means you no longer have any real-world measurement to compare to. The correct approach is to model what the variation is going to be by location, run your model, and then predict what the local temperature averages should be to compare to to the real world stuff. Keep your model and your corrections separate and never confuse raw data with corrected. If you need to weight data for different locations differently, that's fine, but never ever show the weighted plots and announce that these are useful for any real-world comparisons.

Let me give an example. Suppose I have thermometers in the yard: one on the driveway, one under the tree, and one next to the garden. The driveway one will be consistently hotter than the rest, and the one under the tree cooler: say 10 degrees hotter and 5 degrees cooler. If I want to get some estimate for what the variation is from month to month I can take readings, subtract 10 from the driveway reading and add 5 to the tree reading and average them--and look at how that varies. That works fine, and nobody gets confused so long as I show what I'm doing. What they did with their plots was like adding 5 degrees to the tree thermometer plots and showing that as though this was what was actually read. That's confusing and obviously the wrong thing to show people. What they actually did with the data seems to be worse, since there's no way to figure out where the correction came from.

I do not like to think about what kind of reception that sort of analysis would get here. No explanation of the correction terms, and showing corrected data instead of real data? April Fools, right?

No comments: