- "We all strive to write bug-free code. But in the real world, bugs can and do occur. Rather than pretend this isn't so, we should adopt a mission-critical mindset and create software architectures that can contain errors and recover from them intelligently."
The "he" in question is my late (and great) colleague Dan Hildebrand. I'm sure that Dan's original sentences were more nuanced and to the point. But the important thing is that he grokked the importance of "culture" when it comes to designing software for safety-critical systems. A culture in which the right attitudes and the right questions, not just the right techniques, are embraced and encouraged.
Which brings me to a paper written by my inimitable colleagues Chris Hobbs and Yi Zheng. It's titled "Ten truths about building safe embedded software systems" and, sure enough, the first truth is about culture. I quote:
- "A safety culture is not only a culture in which engineers are permitted to raise questions related to safety, but a culture in which they are encouraged to think of each decision in that light..."
I was particularly delighted to read truth #5, which echoes Dan's advice with notable fidelity:
- "Failures will occur: build a system that will recover or move to its design safe state..."
I also remember Dan writing about the importance of software architectures that allow you to diagnose and repair issues in a field-deployed system. Which brings us to truth #10:
- "Our responsibility for a safe system does not end when the product is released. It continues until the last device and the last system are retired."
Dan argued for the importance of these truths in 1993. If anything, they are even more important today, when so much more depends on software. If you care about safe software design, you owe it to yourself to read the paper.