Can a safety-critical system be over-engineered?

Too much of a good thing?
It's a rhetorical question, of course. But hear me out.

As you can imagine, many safe systems must be designed to handle scenarios outside their intended scope. For instance, in many jurisdictions, passenger elevators must be capable of handling 11 times more weight than their recommended maximum — you just never know what people will haul into an elevator car. So, if the stated limit for a passenger elevator is 2000 pounds, the actual limit is closer to 22,000 pounds. (Do me a favor and avoid the temptation to test this for yourself.)

Nonetheless, over-engineering can sometimes be too much of a good thing. This is especially true when an over-engineered component imposes an unanticipated stress on the larger system. In fact, focusing on a specific safety issue without considering overall system dependability can sometimes yield little or no benefit — or even introduce new problems. The engineer must always keep the big picture in mind.

Case in point: the SS Eastland. In 1915 this passenger ship rolled over, killing more than 840 passengers and crew. The Eastland Memorial Society explains what happened:

    "...the Eastland's top-heaviness was largely due to the amount and weight of the lifeboats required on her... after the sinking of the Titanic in 1912, a general panic led to the irrational demand for more lifesaving lifeboat capacity for passengers of ships.
    Lawmakers unfamiliar with naval engineering did not realize that lifeboats cannot always save all lives, if they can save any at all. In conformance to new safety provisions of the 1915 Seaman’s Act, the lifeboats had been added to a ship already known to list easily... lifeboats made the Eastland less not more safe..."

There you have it. A well-intentioned safety feature that achieved the very opposite of its intended purpose.

Fast forward to the 21st century. Recently, my colleague Chris Hobbs wrote a whitepaper on how a narrow design approach can subtly work its way into engineering decisions. Here's the scenario he uses for discussion:

    "The system is a very simple, hypothetical in-cab controller (for an equally hypothetical) ATO system running a driverless Light Rapid Transit (LRT) system...
    Our hypothetical controller has already proven itself in Rome and several other locations. Now a new customer is considering it for an LRT ATO in the La Paz-El Alto metropolitan area in Bolivia. La Paz-El Alto has almost 2.5 million inhabitants living at an elevation that rises above 4,100 meters (13,600 ft.—higher than Mount Erebus). This is a significant change in context, because the threat of soft and hard memory errors caused by cosmic rays increases with elevation. The customer asks for proof that our system can still meet its safety requirements when the risk of soft memory errors caused by radiation is included in our dependability estimates..."

So where should the engineer go from here? How can he or she ensure that the right concerns are being addressed? That is what Chris endeavours to answer. (Spoiler alert: The paper determines that, in this hypothetical case, software detection of soft memory errors isn't a particularly useful solution.)

Highly recommended.