Companies Aren't Immune From IT Holes
Lest we need any reminder that the government is not the only organization capable of big, messy information-technology failures, Felix Salmon has laid out some Technicolor meltdowns in the private sector (and kindly references my upcoming book about failure and recovery, which Felix is reading in galleys). He references James Reason’s Swiss cheese model, which I write about in the book. Broadly, the idea is that you may have security layers with a bunch of holes in them, but if you layer enough of them together, you’re still pretty well protected, because the odds of the holes lining up on seven or eight layers are still pretty slim.
There’s a corollary to that, however: The more layers of Swiss cheese you have, the harder it can be to tell if the holes are lining up. Maybe the holes are lined up on all but one layer -- and then that layer shifts, and you suddenly realize that you’re disastrously unprotected. One way to think about this is modern health care: We’ve got all sorts of backup mechanisms if the body fails, even catastrophically. If you get an autoimmune disease, we’ll suppress your immune system; if your kidney fails, we’ll transplant a new one into you. Hips crippled with arthritis? We’ll give you artificial ones.
But there’s a single layer standing between all this stuff and disaster: antibiotics. There’s a reason that almost all of the surgery in early-20th-century fiction is “lifesaving”; there wasn’t much enthusiasm for surgical treatment of conditions that weren’t fatal or disfiguring, because the surgery itself was so risky. Without antibiotics, the immunosuppression required to do a transplant or halt the progress of rheumatoid arthritis would be far too dangerous. So would much other surgery, especially on the elderly. You’d have to be a fool or a movie star to get cosmetic surgery, as subsequent infection could kill you. Abortions were also extremely risky; all those horrifying statistics about the number of women who used to die from back-alley abortions come from the era before antibiotics. Even before legalization, the risk of dying from a botched abortion had declined to a tiny fraction of what it was in the 1920s. Even in clean and careful conditions, putting a surgical instrument deep inside your body carries a high risk of infection. If the antibiotics went away, we’d suddenly realize that a lot of our other safety layers were absolutely full of holes.
Worse, we’d realize that we’d gotten lazy about watching the other layers, because the antibiotics were always there to bail us out. Human beings have a tendency to think that if they’ve been doing something for a while and nothing bad has happened, then it must be safe -- and so they get lazy about other security procedures. You saw this in the housing bubble: The longer it went on, the more people decided that they’d better get in on the sure thing now. Regulators and bankers had oodles of data suggesting that sustained nationwide declines in housing prices were as rare as hen's teeth. Asset markets tend to be mean-reverting, so the longer all this went on, the more likely a big correction became. But psychologically, the longer it went on, the less likely a big correction seemed to the great mass of ordinary people who did not have a morbid habit of reading financial histories of the 1930s.
Felix notes an example of this at Knight Capital Group, which last year managed to lose a very large boatload of money on a very simple programming error. Presumably, to save time, when they were modifying their system, they repurposed an old bit of code that had been used to activate a trading routine. That trading routine was no longer used, so this seemed safe. However, when they rolled the new stuff out, someone failed to copy the new code to just one of the eight servers they were using. That server activated the old trading routine, effectively bankrupting the company. Oops.
What’s interesting to me is that there were multiple corners cut: They repurposed an old flag in the code, didn’t run either human or automatic checks to make sure that everything was where it was supposed to be, and ignored e-mails the system sent to tell them that it was trying to run the old code. Then, the fix they applied -- which doesn’t seem to have been thought through very well -- made things worse instead of better.
I’ve been in the departments where these decisions get made. A malfunctioning trading or market data system is really, really hard to shut down: Traders operate in seconds or minutes, and freezing them in place, without access to data or the ability to place electronic trades, can mean losing a lot of money. On the other hand, so can bad data and bad trades. The shops that choose to apply fixes on the fly usually get away with it; the shops that choose to shut down the server for an extended time, amid the anguished screams and threats of the traders, usually get no benefit except a lot of angry recriminations from the head of trading. On the other hand, their servers never drive the company into bankruptcy, which sometimes happens with the other kind. The problem is, the more times you skate by the bankruptcy scenario, the more you figure you must be smart rather than lucky.