Boeing’s glitchy software killed 346 people and destroyed 737 max aircraft. Tesla’s memory issues  recalled 135,000 cars to be fixed, New Jersey Vaccine Scheduling System had 70% more duplicate appointments that affected lots of Senior citizens and Therac25 with a Software bug released overdose radiations that killed 06 Cancer patients. As we recollect there were several problems because of improper Changes made to the System. System failures because of improper Changes had caused lots of losses, Business losses, Human losses, eroded Brand Value, Customers confidence and trust. Until People had seen the kind of failures mentioned above, they thought that Software glitches are pretty easy to fix, and are very cost effective  unlike Hardware changes, as just a patch, quick fix, hot fix can resolve the problem. A Bug in the production software can do enormous damage and Businesses can’t just afford it. Lots of thrust had been on the People who develop software and after severe research by teams like DORA, found that Elite teams, high performance teams with  right practices like DevOps and with Productivity tools had built highly reliable Softwares. The metric, Change Failure Rate (CFR) tells about the percentage of Changes made to Production Systems that resulted in failure. Elite teams are highly successful in keeping this score very, very  low.

Fig : Software glitch nose dived Boeing 737

CFR

Change failure rate (CFR) is about what percentage of Changes made to the Production system that had resulted in failures ( degradation of the System from its previous stable state). With the Changes introduced (deployments) in the System, when the System fails, how many of those Changes had failed, been rolled back or remedied through hot fixes or patches or Configuration Changes.

CFR = No. of failed deployments (Changes) / Total no. of deployments (Changes introduced to the Production system).

The Changes introduced into the System could be Configuration Changes, new Features or modifications to existing Product features and functions or even quick fixes and Patches.

Out of 10 changes introduced into the system, 02 happen to fail then the CFR is 20%.

DORA’s Observation with various teams on this metric is as follows.

Lower the CFR value, the teams had been highly successful, the deployments were successful, kept the System by and large, very stable and reliable. Customers prefer highly reliable systems and the Elite teams always try to keep the System very Stable. Teams that had been scoring high on this metric had lots of room for improvement. The teams can upskill, understand the system and collaborate well. Understanding the Production system is quite important because sometimes the perfect Changes introduced may fail on underCapacity Infrastructure. The changes to be introduced needs to be well tested on Pre-Production or Test Systems. Teams that were trying to be quicker, quick to launch the Changes and hit the market quickly, may err in testing the Changes thoroughly and Changes that aren’t well tested could increase the CFR.

Causes of Change Failures

When Production System fails with Changes introduced(Deployments made), the failure (degradation) of the system could be because of slow throughput, hogging resources, malfunctioning because of semantic errors introduced in the system or might be some other reason that the Changes to the production system had impaired its stable functioning. To bring back the System to normalcy (Stable functioning), teams either rollback the Changes made or deploy hotfixes, quick fixes or Patches.

Most of the Change(deployment) failures can be attributed to Processes, Team’s Capability and Communications.

1. Processes need to be followed properly otherwise perfect changes introduced in the system could lead to failures. Say, someone on the team doesn’t update release notes properly, a change to introduce an Integration Event to pull data from another system through a middleware fails then the process needs to ensure that middleware functions before running the Integration Event.

2. Team’s Capability is essential as highly skilled people with good experience in a right environment under good Leadership can Understand the system well, can introduce the well tested Changes (deployments) and can understand all dependencies.

3. Communication is also very important for all stakeholders to understand about Changes and avoid any glitches for a successful deployment.

How to Improve CFR, make every System Change Successful ?

Teams practicing DevOps and on Continuous Improvement are highly successful in making Changes. There are tools to automate the processes and AI/ML programs doing marvelous jobs, helping teams to be very productive and make almost all their Changes (deployments) very successful. Successful Changes (deployments) brings down the CFR. Use of Kaiburr products and practices can certainly bring down the CFR and help teams become Elite, high performance teams.

Kaiburr AI engine in real time, captures Change failures and showcases the metric ‘Change failure rate’ quite beautifully to help Leaders take well informed decisions.

Change Failure Rate

Kaiburr helps software teams to measure and benchmark themselves on 350+ KPIs and 600+ Best Practices so they can continuously improve every day.

Reach us at contact@kaiburr.com to get started with metrics driven continuous improvement in your organization.