Its hardly a fresh thought that business=the jungle, so that understanding the laws of the jungle may help you in business. Likewise the culture and learning in almost any sphere can apply to business; from what you learned in kindergarten to the personal habits of effective people. I do think however that it helps to have a certain state of mind to be receptive to those ideas when you encounter them, because making the connections that can help you meet your goals is far from a given.
Over the past few weeks I had two reading experiences where I felt that important things were being revealed; one direct and concrete example and one a little more abstract, but still mighty interesting.
The direct one was the report of the CAIB- the Columbia Accident Investigation Board. I glanced at the report when it came out to understand why the orbiter broke up, but when I reread it carefully, I realized that it could apply to any mission-critical work (such as our data centers) and organization that has to deal with complex decision making. I now think, in fact, that this report is a treasure trove- a blueprint of the best thinking available on these subjects.
The actual story of the accident and its reconstruction are as good as any episode of CSI that you ever saw- starting with the corpus and ending with the culprits, using tiny bits of information arrayed into an inescapable pattern to reveal the answer.
What Happened is here in my words, but WHY it happened will be in the words of the CAIB.
So what happened? A chunk of foam insulation shed from fuel tank shortly after launch holed the leading edge of the left wing at panel 8. NASA has previously thought that the foam could not hit that hard. In fact, it could hit very hard, because the shuttle hit the foam and not the other way around. The wing can’t tell the difference. The shuttle is so powered up a minute into flight that it literally ran into the foam at around 775 feet per second. For comparison, thats the speed of a .22 pistol bullet that weighs not 2 grams, but over 500 grams (more than a pound). Yes, foam insulation (like in your beverage cooler) can tear stuff up when slammed into something fragile at the speed of a bullet.
Upon Re-entry, a superheated plasma jet blew thru the wing. melted its support, and cut it off like a torch. Around 900 seconds after entry, the wing went by the board, and the shuttle tumbled and broke up at 200,000 feet and Mach 19. The crew cabin may have been intact all the way down; like the Challenger’s was. In fact, the Columbia and Challenger accidents were so similar (although at different flight stages) that the CAIB treats them virtually as one and the same.
The reconstruction of the accident is complete. Pinpointing the exact sequence and location of the failure via multiple independent means was rigorous; mapping the debris west to east across Texas, complex fault trees, chemical studies of deposited burn products, telemetry and timeline reconstruction, radar and signal study, and many others. These people know what they are doing.
The value of this work in mission critical businesses is that the board used that same structured method to look at the organization behind the effort- and the failure modes of communications and decision making that led to the loss (NASA speak: LOCV / VBD- Loss of Crew and Vehicle / Very Bad Day).
Why the Accident Occured, from the report:
“Attempting to manage high-risk technologies while minimizing failures is an extraordinary challenge. By their nature, these complex technologies are intricate, with many interrelated parts. Standing alone, the components may be well understood and have failure modes that can be anticipated.
Yet when these components are integrated into a larger system, unanticipated interactions can occur that lead to catastrophic outcomes. The risk of these complex systems is increased when they are produced and operated by complex organizations that also break down in unanticipated ways.”
“In our view, the NASA organizational culture had as much to do with this accident as the foam. Organizational culture refers to the basic values, norms, beliefs, and practices that characterize the functioning of an institution. At the most basic level, organizational culture defines the assumptions that employees make as they carry out their work. It is a powerful force that can persist through reorganizations and the change of key personnel. It can be a positive or a negative force.”
Orgazational causes of the accident:
“organizational barriers which prevented effective communication of critical safety information and stifled professional differences of opinion; lack of integrated management across program elements; and the evolution of an informal chain of command and decision-making processes that operated outside the organization’s rules.” Fix; Establish an independent Technical Engineering Authority purposed “to build a disciplined, systematic approach to identifying, analyzing, and controlling hazards”
These failures applied to both Columbia (foam) and Challenger (o-rings)
“The history of engineering decisions on foam and O-ring incidents had identical trajectories that “normalized” these anomalies, so that flying with these flaws became routine and acceptable.”
“From the beginning, NASA’s belief about both these problems was affected by the fact that engineers were evaluating them in a work environment where technical problems were normal. Although management treated the Shuttle as operational, it was in reality an experimental vehicle. Many anomalies were expected on each mission. Against this backdrop, an anomaly was not in itself a warning sign of impending catastrophe. Foam debris and eroding O-rings were defined as nagging issues of seemingly little consequence”
“A perennially weakened safety system, unable to critically analyze and intervene, had no choice but to ratify the existing risk assessments on these two problems. The following comparison shows that these system effects persisted through time, and affected engineering decisions in the years leading up to both accidents. NASA was transformed from a research and development agency to more of a business, with schedules, production pressures, deadlines, and cost efficiency goals elevated to the level of technical innovation and safety goals. When pressed for cost reduction, NASA attacked its own safety system.”
NASA’s culture of bureaucratic accountability emphasized chain of command, procedure, following the rules, and going by the book. While rules and procedures were essential for coordination, they had an unintended but negative effect. Allegiance to hierarchy and procedure had replaced deference to NASA engineers’ technical expertise. The organizational structure and hierarchy blocked effective communication of technical problems. Signals were overlooked, people were silenced, and useful information and dissenting views on technical issues did not surface at higher levels. What was communicated to parts of the organization was that O-ring erosion and foam debris were not problems.
What do people in your organization take for granted that could actually bite ?
As what the Board calls an “informal chain of command” began to shape the outcome, location in the structure empowered some to speak and silenced others. For example, a Thermal Protection System tile expert, who was a member of the Debris Assessment Team but had an office in the more prestigious Shuttle Program, used his personal network to shape the Mission Management Team view and snuff out dissent.
Which know-it-all in your organization is downplaying serious problems and working that view into the consensus understanding?
“Strategies must increase the clarity, strength, and presence of signals that challenge assumptions about risk. Twice in NASA history, the agency embarked on a slippery slope that resulted in catastrophe. In both pre-accident periods, events unfolded over a long time and in small increments rather than in sudden and dramatic occurrences. NASA’s challenge is to design systems that maximize the clarity of signals, amplify weak signals so they can be tracked, and account for missing signals. ”
Trade ‘weak signals’ for ‘what the little people think’ in your organization. All you have to do is ask.
The board explicitly does that:
“It is obvious but worth acknowledging that people who are marginal and powerless in organizations may have useful information or opinions that they don’t express. Even when these people are encouraged to speak, they find it intimidating to contradict a leader’s strategy or a group consensus. Extra effort must be made to contribute all relevant information to discussions of risk. Adopt and maintain a schedule that is consistent with available resources. Although schedule deadlines are an important management tool, those deadlines must be regularly evaluated to ensure that any additional risk incurred to meet the schedule is recognized, understood, and acceptable”.
I think thats some pretty solid management advice all the way around.
The other item that I found interesting was a statistical model developed for baseball that attempts to measure a player’s “grittiness’. We all know what a gritty player is: someone who hustles all the time. Charlie Hustle himself, Pete Rose, is the exemplar of the breed, but it turns out that the stats say otherwise. The effort provides some interesting thinking on the value of determination v. talent, and the value of getting it done ugly v. getting it done efficiently.
After the success of the Red Sox, I think Moneyball is here to stay for awhile.
Finally, regardless of management style, a poor core idea at the start may echo in different, but equally nasty, ways throughout the life of a project. For example, its always a bad idea to put a crew vehicle anywhere but at the very top of a launch stack, and I doubt we will see a human carrying vehicle on the side of a stack ever again. Where do you stand if your stack blows up or starts shedding parts ?










Leave a Reply