So they built a V1 product to get something out, it was thrown together quickly. They thought maybe 10,000 people would use the app. It turns out the device was really good and people really liked the app, and a million people wanted to use the app. The problem became not just that the app was never designed for that amount of users, but neither was the data storage nor any of the infrastructure. The engineering systems themselves were never built to support that number of customers.
Next I asked Aaron to weigh in on whether it is true that devops is what happens when the engineering culture doesn’t take care of or consider the infrastructure.
To him, devops looks like this: you’ve got dev and you’ve got ops. In between the two is the middle ground called deployment where you are done building something but it’s not yet running on a production cloud or something similar. In general when talking about devops he considers it the family of technologies and design approaches that are all one big factory.
He gives the story of the Airbus A400M which fell from the sky in 2015 while they were doing a test flight. On that test flight they had designed their software in a tiny little computer inside every engine to control the fuel flow. The software must have passed its test because it was used in the build.
So they took it for a test run one morning with the new software added to the engines. The plane took off fine, but when the pilot adjusted the throttles downward all the engines went into idle except the one without the new software. So three engines turned into idle, and no matter what the pilot did with the controls he couldn’t get any more fuel into those other engines. As a result, the plane couldn’t maintain air speed and it fell from the sky, and everyone aboard died.
It turns out the software was fine, the problem was in the deployment. The person who did the update of the software into the engines did so using a non-automated system, and he missed a step. There was a file that didn’t get copied onto the engines, and it was the data file that said how much fuel to put into the engines. As a result, the engines couldn’t operate and there wasn’t a thorough enough system test to discover the error beforehand.
It’s a tragic story highlighting the importance of every aspect of development, deployment and operations.
We continue with that discussion before seguing into functional programming, why his company provides commercial Haskell tooling and work, what it’s been like being a CEO at FP Complete. You’ll hear from Aaron and I on those topics and more on today’s CTO Studio.