Monday, November 23, 2009

What's in the climate models?

Everyone has been talking about the emails in the Hadley/CRU data leak. I mentioned over the weekend that people will be looking at the code, and the data. I said that it will be a while before we see much. We're already seeing all sorts of interesting revelations from the code.

The Climate Model is coded to intentionally suppress non-warming results

Michael Mann (popularizer of the Hockey Stick) personally cooked the code:

Busted- Phil Jones Doesn’t Recall Divergence

I’ve just completed Mike’s Nature trick of adding in the real temps to each series for the last 20 years (ie from 1981 onwards) amd [sic] from1961 for Keith’s to hide the decline.

Check out this quote from the code – cudo’s to Steve Neil for digging it out.

function mkp2correlation,indts,depts,remts,t,filter=filter,refperiod=refperiod,$
datathresh=datathresh
;
; THIS WORKS WITH REMTS BEING A 2D ARRAY (nseries,ntime) OF MULTIPLE TIMESERIES
; WHOSE INFLUENCE IS TO BE REMOVED. UNFORTUNATELY THE IDL5.4 p_correlate
; FAILS WITH >1 SERIES TO HOLD CONSTANT, SO I HAVE TO REMOVE THEIR INFLUENCE
; FROM BOTH INDTS AND DEPTS USING MULTIPLE LINEAR REGRESSION AND THEN USE THE
; USUAL correlate FUNCTION ON THE RESIDUALS.
;

pro maps12,yrstart,doinfill=doinfill
;
; Plots 24 yearly maps of calibrated (PCR-infilled or not) MXD reconstructions
; of growing season temperatures. Uses “corrected” MXD – but shouldn’t usually
; plot past 1960 because these will be artificially adjusted to look closer to
; the real temperatures.

;

"Divergence" is a big problem for the climate change alarmists. The proxy data sets that they use (tree rings, etc) show that for the last 50 years, the temperature as shown by the proxies has been lower than the temperature readings that we get from thermometers. How to address this problem? Code it out of the models.

Poor programming implies no Quality Control

It looks like at least some of the models were written by people that don't know the basics of how to code:

The bit that made me laugh was this bit. Anyone into programming will burst out laughing before the table of numbers

Quote:
17. Inserted debug statements into anomdtb.f90, discovered that
a sum-of-squared variable is becoming very, very negative!

...

For those unfamiliar with this problem, computers use a single “bit” to indicate sign. If that is set to a “1″ you get one sign (often negative, but machine and language dependent to some extent) and if it is “0″ you get another (typically positive).

OK, take a zero, and start adding ones onto it. We will use a very short number (only 4 digits long, each can be a zero or a one. The first digit is the “sign bit”). I’ll translate each binary number into the decimal equivalent next to it.

0000  zero
0001 one
0010 two
0011 three
0100 four
0101 five
0110 six
0111 seven
1000 negative (may be defined as = zero, but oftentimes
defined as being as large a negative number as you can
have via something called a 'complement'). So in this
case NEGATIVE seven
1001 NEGATIVE six
1010 NEGATIVE five (notice the 'bit pattern' is exactly the
opposite of the "five" pattern... it is 'the complement').
1011 NEGATIVE four
1100 NEGATIVE three
1101 NEGATIVE two
1110 NEGATIVE one
1111 NEGATIVE zero (useful to let you have zero without
needing to have a 'sign change' operation done)
0000 zero

Sometimes the 1111 pattern will be “special” in some way. And there are other ways of doing the math down at the hardware level, but this is a useful example.

You can see how adding a digit repeatedly grows to a large value (the limit) then “overflows” into a negative value. This is a common error in computer math and something I was taught in the first couple of weeks of my very first programming class ever. Yes, in FORTRAN.

This is, quite frankly, a complete n00b error. Anybody working in industry who made this mistake would find himself in the "bottom 5%" group come annual review time, and would very likely get a suggestion to look for work elsewhere.

OK, so the University of East Anglia has some bad programmers. So what? Well, this means that large parts of the climate models have never had a design review or code review. This means that the model is essentially unaudited for correctness. This means that there's no assurance that it produces output that's sane - even discounting for Dr. Jone's code to "fix" divergence.

If I could only ask one question at a Senate Hearing, mine would be "What Quality Control processes do you have for climate model software development." 'Cause it looks like there aren't any.

The programmers don't understand what the data is, and what it is for

Maintenance coding (maintaining a program someone else wrote) isn't any fun, not least because the person who wrote it may not have documented what the parts are and what they do. It looks like things are no different at CRU:
7. Removed 4-line header from a couple of .glo files and loaded them into
Matlab. Reshaped to 360r x 720c and plotted; looks OK for global temp
(anomalies) data. Deduce that .glo files, after the header, contain data
taken row-by-row starting with the Northernmost, and presented as '8E12.4'.
The grid is from -180 to +180 rather than 0 to 360.
This should allow us to deduce the meaning of the co-ordinate pairs used to
describe each cell in a .grim file (we know the first number is the lon or
column, the second the lat or row - but which way up are the latitudes? And
where do the longitudes break?

There is another problem: the values are anomalies, wheras the 'public'
.grim files are actual values. So Tim's explanations (in _READ_ME.txt) are
incorrect..

8. Had a hunt and found an identically-named temperature database file which
did include normals lines at the start of every station. How handy - naming
two different files with exactly the same name and relying on their location
to differentiate!
Aaarrgghh!! Re-ran anomdtb:


Uhm... So they don't even KNOW WHAT THE ****ING DATA MEANS?!?!?!?!

What dumbass names **** that way?!

Talk about cluster****. This whole file is a HUGE ASS example of it. If they deal with data this way, there's no ****ing wonder they've lost **** along they way. This is just unbelievable.

And it's not just one instance of not knowing what the hell is going on either:

Quote:
The deduction so far is that the DTR-derived CLD is waaay off. The DTR looks OK, well
OK in the sense that it doesn;t have prominent bands! So it's either the factors and
offsets from the regression, or the way they've been applied in dtr2cld.

Well, dtr2cld is not the world's most complicated program. Wheras cloudreg is, and I
immediately found a mistake! Scanning forward to 1951 was done with a loop that, for
completely unfathomable reasons, didn't include months! So we read 50 grids instead
of 600!!! That may have had something to do with it. I also noticed, as I was correcting
THAT, that I reopened the DTR and CLD data files when I should have been opening the
bloody station files!! I can only assume that I was being interrupted continually when
I was writing this thing. Running with those bits fixed improved matters somewhat,
though now there's a problem in that one 5-degree band (10S to 5S) has no stations! This
will be due to low station counts in that region, plus removal of duplicate values.


I've only actually read about 1000 lines of this, but started skipping through it to see if it was all like that when I found that second quote above somewhere way down in the file....

CLUSTER.... ****. This isn't science, it's gradeschool for people with big data sets.
What does this mean, in non-technical terms?

It explains why CRU would not release their code and data, even under Freedom Of Information Act requests. They knew that the quality was terribly shoddy, and took the chance that they could successfully stonewall, rather than have their climate models be exposed as junk.

And they were successful stonewalling, until someone on the inside leaked their data.

5 comments:

wolfwalker said...

Regarding that sign-bit error, you wrote: "Anybody working in industry who made this mistake would find himself in the "bottom 5%" group come annual review time, and would very likely get a suggestion to look for work elsewhere."

I think you need an extra phrase here: anybody who made this error and didn't correct it.

Something I can't tell, either from your link or from his source-link: when was the error made, and was the errant code ever used in production? Even an expert programmer could make such a mistake if hurried, or exhausted, or working in an unfamiliar language. But most would catch it themselves in alpha-testing before anyone else ever saw the code.

Borepatch said...

Wolfwalker, that's an excellent point. People make coding errors all the time - this is why you have design and code reviews, and QA testing.

It doesn't look like that was done here.

wolfwalker said...

No, it doesn't. With some help from Dogpile I finally tracked down the original file that these guys are talking about. You can find it here, among other places: http://www.anenglishmanscastle.com/HARRY_READ_ME.txt

Reading it is .... I have no words. Let's say "more horrifying than Stephen King on his worst days." For a programmer, at least. I felt like an English teacher reading the entries to this year's Bulwer-Lytton Fiction Writing Contest.

The file is a memo, more or less. It's obviously a QA catalog: "Harry" was going through a bunch of programs methodically and documenting all the problems he found, then trying to fix them. Some of the errors are ... well, mind-numbing. Duplicate filenames. Version control problems. Undocumented switches and parameters. Data overflow errors. Variables that weren't properly initialized.

If "Harry" was looking at production code, and it appears that he was, then the implications are staggering. These are not just firing offenses. These are criminal offenses. Or if they aren't, they should be.

Borepatch said...

Wolfwalker, there are other files in the archive than just that one.

There hasn't been any QA done on these models, probably ever. It's astonishing.

wolfwalker said...

Oh, of course there are other files. I was talking specifically about the file that included the description of the sign-bit error.

I don't know about no QA at all (see, I'm trying really really hard to be fair). What it looks like to me is a classic case of white-coats who didn't have a real programmer available so they tried to do it themselves, and rapidly got in way over their heads. Documentation is always the first thing to suffer in such an environment, and QA is usually the second. Non-programmers just don't understand why those things are necessary, so they don't do them. Or they don't do enough of them.

In any case, it's really bad. Nothing coming out of East Anglia CRU is trustworthy. It all needs to be disassembled down to the ground and completely rebuilt.