No fault go-arounds

In aviation, if you think there’s going to be a problem with a landing or a take-off, you stop, pull away, and try again.  There are two somewhat conflicting quotes that I love:

If you can walk away from a landing, it’s a good landing. If you use the airplane the next day, it’s an outstanding landing.
– Chuck Yeager

A superior pilot uses his superior judgment to avoid situations which require the use of his superior skill.
– Frank Borman

So the Yeager quote is humorous, and I think it’s great for test pilots (in technology, the analogy would correspond to your development and test environments): do what you can, try it out- because the data and equipment aren’t sacred.  But if you’re Part 121 (by analogy, your production environment), there is no room for tomfoolery. And for aviation, the proof is in the pudding: we’ve had exactly two Part 121 deaths since 2009.

I work in DevOps, and I recently experienced a significant difference in tactics with a product owner who really wanted to just get the thing deployed, because $BusinessReason. However, she was taking the Yeager approach in production, and didn’t want to hear anything about what I had to say. It was a swap-type deployment, and my methodology was, “we flipped the switch, it doesn’t work out of the box- time to go-around. Flip the switch back and try it once we have a better understanding of the system.” But: the product owner was really fearful of a rollback, to the point that she termed it the “R-word”. My response is that the “S-word” (severe incident) is worse, and that we were taking unplanned and undocumented risk by playing with the system.

The point is that a “rollback” should be just like a go-around- if you’re not sure, pause and try again. My suggestion is that in most organizations, you’d be well-advised to follow Borman’s ideas for production systems.