Software Hell

Glitches cost billions of dollars and jeopardize human lives. How can we kill the bugs?

June 10 of this year is a day the software industry would love to forget. Auction site eBay suffered a 22-hour system crash--the longest, but not last, in a series of crippling software-related outages. The magnitude of the crash signaled more than just the temporary interruption of Beanie Baby trading. It stoked concerns across the computer industry that software, in its current form, may not be a match for the voracious demands of the information economy.

EBay's crash was traced to a glitch in software from Sun Microsystems, resulting in the corruption of information in an Oracle Corp database. Sun chairman and CEO Scott McNealy acknowledges that there was a known bug--as well as a standard fix that eBay failed to make. But as the auction site managers struggled to keep up with phenomenal demand for its services, the patch had fallen through the cracks. Given eBay's explosive growth, "all of the suppliers would agree that we're all a little winded," McNealy says.

EBay's software struggles resonate loudly across the whole convoluted and turbulent world of computer software. Regulatory authorities are investigating thousands of consumer complaints about glitches plaguing every conceivable type of computer service. But no agency on earth can quell the havoc being wreaked around the planet by ubiquitous software that is too slapdash, buggy, complex, or ill-conceived to accomplish the myriad tasks under its control. The Year 2000 bug has focused attention on the topic. But Y2K tells only a tiny fragment of this story.

With the low points of just the past 24 months, you could build a Software Hall of Infamy. In a fast-flowing river of woe, software bugs--along with viruses and security loopholes--have plagued most new releases of Microsoft Windows and Office products, Netscape Navigator, Intuit's Quicken, and countless other personal-computer applications. Glitches have crippled online auctions and trading sites and delayed product shipments at Hershey, Whirlpool, and Handspring, maker of Visor palm computers. All told, bad software cost U.S. businesses $85 billion in lost productivity last year, according to Jim Johnson, president of market researcher The Standish Group in Dennis, Mass.

At U.S. government agencies, bugs are a public disgrace--when they are not a menace. The Federal Aviation Administration, the Social Security Administration, the Immigration & Naturalization Service, and the Internal Revenue Service are all struggling with balky software--at appalling costs to taxpayers. And more than dollars are at stake. Over the past two decades, bad software has been implicated in plane crashes, road and rail accidents, and malfunctions of medical gear that resulted in death--lending ghoulish new meaning to the term "killer app." Recent glitches have knocked out AT&T's high-speed phone and data networks and interrupted emergency service in New York. "Software easily rates among the most poorly constructed, unreliable, and least maintainable technological artifacts invented by man," says Paul Strassmann, a former chief information officer for Xerox Corp. and for the Defense Dept who now heads a private consulting company. Most software executives share at least some of this dismay.

To be fair, software also shares credit for the most spellbinding advances of the 20th century. In today's world, banks, hospitals, and space missions would be inconceivable without it. The challenge of the next century will be to exterminate the most pernicious bugs and to bring software quality to the same level we expect from cars, televisions, and other relatively dependable hunks of hardware.

MIXED BLESSING. In this, there are glimmers of hope. A movement called "open source" draws programmers together around the globe to continuously debug major programs. The Internet provides a platform for such collaboration and an instant feedback channel when things go wrong. Governments have also joined hands with industry to impose greater rigor on software development, hoping to transform it from sorcery to science.

Unfortunately, none of these developments guarantees quick deliverance for sufferers in Software Hell. The Net itself is a mixed blessing: Its business culture values speed over quality in software development. What's more, on the Net, the world is interconnected. Vulnerabilities to both bugs and viruses are thus multiplying geometrically, as are the damages and costs. Peter G. Neumann, a computer scientist at SRI International in Menlo Park, Calif., says the world is now trapped in a downward spiral when it comes to software quality and complexity. "Improvements in some quarters are followed by increased risks in others," he says. "And new systems introduce more problems than the systems they replace."

For most computer users, the horrors of software begin at home. The PC may be the most versatile home gadget ever developed--but it is also the buggiest and the most complex. The bugs, in fact, are built into the business model: Software companies make money on upgrades, so there is little incentive to achieve a perfect first release.

Complexity is also built into the system. You see it each time you pop in a new CD-ROM and the PC asks permission to rewrite helper programs such as DirectX and QuickTime that are stored on your hard drive. What's the right answer? It depends. "Yes" may prevent you from loading your old CD-ROMs, while "No" keeps you from running the new one.

There are no simple remedies for such absurdities, says Kirk Kirschenbauer, a director of software development at The Learning Co., an educational software division of Mattel Inc. "Two months after a purchase, the number of ways that different device drivers and pieces of software could be interacting is infinite," he says. The Learning Co. rigorously tests its programs on PCs configured in different ways. But there is no way it can predict the unique configuration in each home.

Now, the PC software industry is reaching into new domains, such as cell-phone networks and automobile electronics. Historically, such products have depended on high-quality, embedded software that the user never touched or tinkered with. All that could change, however, when cell phones become portable Web browsers. Soon, phones may be running some of the same sorts of slapdash applications as a PC--and crashing like one, too. Cars could be next. Right now, auto electronic systems don't flash the worst system error--the Blue Screen of Death. But that could change when off-the-rack applications--from mapping software to speech-activated Net browsing--are humming on the dashboard, controlled by PC-derived operating systems.

BILLIONS AND BILLIONS. PC software is just one tier of Software Hell. Step into the world of business--especially the world of Web-centric computing--and the software dilemma becomes even more troublesome. Compared with ordinary consumers, large Web sites such as eBay, E*trade, and Charles Schwab require a higher grade of hardware and software to manage billions of transactions. For this, they turn to the likes of IBM, Sun Microsystems, and Hewlett-Packard for systems that won't die at a critical juncture.

The hardware generally performs as billed. But the software is often a disaster. For example, about 7.5 million investors in the U.S. have flocked to online brokerages, according to the North American Securities Administrators Association Inc. (NASAA) in Washington. This year, also, the U.S. Securities & Exchange Commission has fielded more than 20,000 investor complaints of outages, errors, fund-transfer delays, and other mishaps plaguing computer systems at such brokers as Ameritrade, E*trade, Charles Schwab, and other online trading ventures. That's up from about 1,000 last year. Most of these glitches are linked, in one form or another, to software issues. The programs involved were designed, developed, and debugged with "mission-critical" enterprise computing in mind. But that hasn't saved them.

Eileen Allen, 56, learned this the hard way. Earlier this year, the self-described stay-at-home mom in Maple Valley, Wash., purchased 13 shares of through online broker Ameritrade at 186, its all-time high at that time. As Allen tells it, a glitch in Ameritrade's computer system then converted her order into 186 shares at 190. She quickly tried to unwind the trade but couldn't get through to Ameritrade by e-mail or phone. Over the next 10 days, Amazon nosedived to 104. Ameritrade says the problem was a user error, not a glitch. Either way, Allen and her husband had to cash in an IRA, liquidate other savings, and sell the Amazon stock--for a net loss of more than $15,000.

Whichever side is right, online crashes are no anomaly. The most fail-safe computer systems turn unreliable when they are thrown into untested combinations. Even when individual office-software products aren't buggy, corporate computing environments are so complex that they are inherently unreliable. Typically, these systems are unholy agglomerations of mainframes and minicomputers, PCs, Macs, and workstations, in thousands of unique configurations, running dozens of different operating systems that were neither designed to cooperate with one another nor tested in combination. Atop these juggernauts run thousands of different software applications, coexisting in countless versions and combinations.

To stitch it all together, most big companies depend on layer upon layer of hand-built, poorly documented computer code, which may conceal a variety of ticking time bombs--Y2K being just the most famous. According to the U.S. Defense Dept. and the Software Engineering Institute at Carnegie Mellon University, there are typically 5 to 15 flaws in every 1,000 lines of code--the cryptic software instructions that make sense only to computers and programmers. Just tracking down each bug eats up about 75 minutes, according to a five-year Pentagon study. And fixing them takes two to nine hours each. On the outside, that's 150 hours, or roughly $30,000, to cleanse every 1,000 lines.

Integrated enterprise software from SAP, BAAN, PeopleSoft, and other suppliers may remedy some aspects of the problem--if only because they tie operations together in one standard and heavily tested suite. But such programs don't guarantee a smooth sail, as recent mishaps in the enterprise computing arena demonstrate. A year ago in August, defunct drug distributor FoxMeyer Corp. sued SAP for $500 million, claiming the company's R/3 software could not cope with FoxMeyer's high volume of orders. No ruling has been made in the case. This fall, Whirlpool held SAP partly to blame for delays in appliance shipments to distributors and retailers. And SAP software glitches impaired Hershey Foods Corp.'s ability to meet Halloween commitments with distributors and retailers.

CEO Kevin McKay, who heads SAP's American operations, blames complexity. Each customer, he says, has a different culture and a different approach to installing software. "You can't possibly replicate the myriad of ways in which companies will start to use the software," he says. "No testing is comprehensive enough to get at this."

FAST AND DIRTY. The situation hasn't always been so dire. There was a time, in the 1970s, when companies could rely on nearly bulletproof software, supplied by mainframe computer companies such as IBM and Sperry. Not only was the code dependable, but the hardware environment was relatively homogeneous. Less complexity meant fewer glitches. Then came computer "downsizing" and the so-called client-server revolution, in which thousands of businesses shifted operations from mainframes to distributed networks of workstations and PCs. This model spelled the demise of insulated, bug-free software and left corporations exposed to the fast-and-dirty culture of PC software.

Despite the patches that Microsoft diligently posts on its Web site, quality issues in the PC world are so disturbing that some managers seek to isolate their key systems from Windows code. A peek into the computer room at the New York Clearing House (NYCH) in midtown Manhattan explains why. About $1.2 trillion in electronic interbank payments are cleared each day by two Unisys Corp. mainframes. If one of these systems is down for a day, banks consider it a major international incident. The software was developed for operations that must not fail. The code is virtually bug-free and fully insulated from the messy world outside. For the past seven years, NYCH has clocked just 0.01% downtime.

NYCH also has some PC servers running Microsoft's Windows NT, which it uses mostly for simple communications programs. These systems are another story. They crash on a regular basis, says George F. Thomas, director of information systems. That's why Thomas was alarmed, 18 months ago, when Unisys began building some NT functions into its latest mainframe operating system. Eventually that system will touch NYCH's critical electronic transactions--exposing them to the threat of a crash. "We have extreme concerns," Thomas says. "We want to know where each of these NT components is going."

In its noncritical systems running NT, NYCH is particularly dismayed by the paucity of tools for diagnosing and fixing problems. When an NT box goes down, "it's hard to tell what went wrong," says NYCH technical services vice-president Albert G. Wood. Microsoft's service culture makes it worse, he says. Unless you are a very big Microsoft account or pay for a pricey maintenance contract, don't expect a quick response, even in a catastrophic crash, he warns. "You have to prove you're using the software properly, then go through all these gyrations just to get them to look at it," says Wood.

Microsoft doesn't deny that it has relied on customers to help debug its PC products. Chief Operating Officer Robert Herbold lays the blame on a broader market phenomenon. Intense competition forces vendors to rush products out quickly and is also responsible for the increasing size and intricacy of almost all software. "Our challenge is to make [software] simpler," he says.

COLOSSUS. That could help with home applications. But analysts think future bugs lurk in Microsoft's pipeline. The next iteration of Windows NT, due next year, is a colossus called Windows 2000 with an unprecedented 30 million lines of code. The bigger the program, the greater the likelihood of bugs. But Jim Allchin, head of Windows development and marketing, says the program's core is no larger than a Unix operating system. Microsoft has spent 500 people-years to make the code reliable. "We're still not perfect," says Allchin. "But I hope customers will appreciate it."

Microsoft, in any case, is not the source of all bugs. Most giant infrastructure projects continuously battle complexity, and they sometimes lose. AT&T was bloodied in April, 1998, when a Cisco Systems Inc. switch on the phone giant's high-speed fiber network suddenly flooded other nodes with error messages. These propagated across dozens of other switches, knocking out thousands of bank ATMs and credit-card readers in stores. Cisco accepted the blame, and the problem was fixed. But nobody can promise there won't be similar events. "You can do all the analysis you want," says Daniel Sheinbein, vice-president for network architecture and development at AT&T. "But there's always a case of something arising that you didn't anticipate."

Bad software jeopardizes staggering sums of money. A drawn-out overhaul of the U.S. Internal Revenue Service's computer systems cost Americans about $4 billion. Bad software has hobbled several of NASA's most expensive missions. The U.S. auto industry loses about $1 billion a year because of software incompatibilities, according to the National Institute of Standards.

The most alarming facet of software in the 21st century, however, is the interconnectedness of complex systems. Over the Net, software now links computers that were once insulated from one another, stacking up additional layers of complexity that breed software bugs. Telecommunications and bank trading networks are both global and cross-connected, as are the myriad traffic and commerce subsystems that rely on them. Even the computer backbones of utility grids, military bases, and weapons laboratories--once isolated islands of fail-safe mainframes--are exposed to the global Internet at countless interconnects. And everywhere across this electronic infrastructure there are weak links in the form of personal computers running software that was rushed to market without adequate testing or debugging.

The recent rash of virus attacks further highlights the dangers of an interconnected world: When the Internet holds sway, vulnerability to hackers is tantamount to a bug. A door was opened to some of the worst viruses when Microsoft began integrating Internet capability directly into its Office suite of products. The goal was to make its Windows and Office programs more convenient for users. But integration also made the hacker's job easier. Last Easter, the Melissa virus exploited the doors that integration left open. All the virus accomplished, in the end, was to generate chains of phony e-mail, clogging servers and pulling down some office networks. But in the hands of a different criminal, those e-mails could have been fake financial reports, generated on your computer, signed by you, and shipped from your e-mail address.

Melissa, a new virus called Bubble Boy, and numerous recent Web-site hacks and break-ins show that hackers are exploiting weaknesses in intertwined software products. "It's really a crisis, and as a community, we have no idea how to solve it," says Bruce Schneier, chief technical officer of Counterpane Internet Security Inc. in San Jose, Calif., and author of Applied Cryptography.

Because buggy software is a global headache, engineers around the world are mounting coordinated efforts to find remedies. And some of the results show promise. Many programmers are encouraged by the successes of the "open-source" movement--a sprawling, global confederation of software developers whose crowning achievement is the popular Linux operating system. With thousands of programmers pooling their skills to build and test such programs, bugs are discovered early "and the fixes are obvious to somebody," says open-source visionary Eric Raymond. "Given enough eyeballs," he contends, "all bugs are shallow."

In an ideal world, however, bugs would be rare and the energy now spent on fixing software could be channeled in more productive directions. The problem, according to many experts, is that writing software is mainly a shoot-from-the-hip affair. "It's like trying to build a bridge without an architectural plan," says Stephen E. Cross, director of Carnegie Mellon's Software Engineering Institute.

Today, engineers can design and test something as complex as a Boeing 777 in cyberspace. But paradoxically, that's not possible with big software programs. The physical laws governing how metal behaves when shaped into a plane and hurled through the air are well known. For software, there is no such body of basic science.

The National Science Foundation wants to take what is now a dark art and turn it into a structured discipline. In terms of both the challenges and the potential impact, this is the software equivalent of deciphering the human genome. The goal is to provide software engineers with the "genetic" information to create accurate models of various organs, or modules, that could be used over and over to assemble all kinds of systems. Sprinkle on some artificial intelligence in the form of software agents, and complex software might even write itself.

Imagine a library of debugged modules, each with its own smart agent. To produce a program, an engineer would simply specify the software's job, then the agents would coordinate among themselves to figure out how to patch together the desired result. "I'm optimistic that agents can make a significant difference," says T.W. Cook, director of software research at Microelectronics & Computer Technology Corp., an R&D consortium in Austin.

MILITARY VICTORIES. The U.S. Defense Dept. is also eager to codify software's basic laws. That's because its weapons require frequent software upgrades in order to stay in service for decades. To trim these costs, the government wants to capture the essence of its weapons in software models so simulations can determine what changes are needed and how best to implement them. The Defense Advanced Research Projects Agency is funding work at some 50 research labs under a four-year-old project called Evolutionary Design of Complex Software--and it is starting to rack up victories.

For example, Xerox Corp.'s Palo Alto Research Center recently produced a mathematical model for "constraint-based scheduling." This deals with regulating the sequence of operations inside a copier or a jet plane. "Historically, this has always been an extremely complex part of the code--and extremely hard to get right," says Gregor J. Kiczales, a senior scientist at PARC. "Now we can generate this code automatically."

Perhaps the toughest challenge in fixing software will be reducing vulnerability to viruses and other malicious attacks. One promising approach is to treat complex computer systems as rapidly evolving organisms and arm them with immune systems that can fight intrusions. At the University of New Mexico and the Santa Fe Institute, computer-science professor Stephanie Forrest is developing a model for one such system. She's crafting computer code that spots unusual activity in a software system, identifies it as an attack, and slows the suspect activity down to the point where it is benign. "Trying to respond to break-ins manually will no longer work," she says. "Each computer needs to take care of itself."

It may be a long time, however, before these and other research approaches trickle into the commercial software market. Meanwhile, software companies have shown little inclination to grapple with the factors that drag quality down. Indeed, the drift may be in the opposite direction. For months, software publishers have quietly been lobbying for legislation known as the Uniform Computer Information Transactions Act, or UCITA. Its impact would be to strip from consumers the means to take legal action when software failed to meet reasonable expectations for quality. "In the service of protecting the worst of the publishers, UCITA will change the economics of defective products for the field as a whole," says Cem Kaner, a Silicon Valley-based attorney specializing in software quality.

Certainly, the pressures that lead to poor software quality are likely to persist. And users bear part of the responsibility. "The customer wants new features," says Intuit's Scott Cook. Bugs, he says, "are the dark side of rapid innovation and entrepreneurship." The last thing the software industry needs, however, is a blame game. It must find the fixes that will bring software back into the light.