Let me tell you a story of an agency data center director.
The CIO has asked her to achieve the following goals to support availability and an IT Infrastructure Library (ITIL) process improvement initiative:
- Set up a configuration management database (CMDB).
- Control it for drift.
- Protect key IT infrastructure.
- Ensure availability across the service stack.
- Ensure any outages can be quickly resolved.
Furthermore, she must show value within 90 days, because the agency CIO has had enough of never-ending process improvement initiatives with nothing to show for the effort. This sounds like a tough problem, doesn’t it?
Coming up with repeatable solutions to these challenges is what Kevin Behr and I set out to do in 2000 when we created the IT Process Institute (www.itpi.org). The goal was to advance the quantitative science in IT operations to help organizations become more effective, efficient and secure. Having studied high-performing IT organizations for eight years now, we can say with certainty that while all high-performing organizations face challenges similar to this data center director, how each dealt with the challenge was unique and set them apart from their peers.
When we started looking at these high-performing IT organizations, none had adopted ITIL and none had a CMDB. But they did use the concepts of a CMDB and used some of the processes that ITIL defined. In sum, we found that these organizations went through a similar journey and changed how they did work in the IT organization.
There are five steps this agency director can implement to reach her objectives. Each step is self-fueling in that it creates immediate value that will propel the agency to the next step.
First, high-performing IT organizations focus their attention on fragile artifacts and services that cause the most unplanned work or have the greatest impact on the business.
These fragile artifacts and services have the highest probability to impair the business. You can identify these artifacts and services by a few key qualities. Specifically, they are the costliest to manage because they have the lowest change success rates and the highest mean time to repair (MTTR). The Pareto Principle tells us that 20 percent of the infrastructure causes 80 percent of the firefighting. High-performing organizations recognize this and actively address the 20 percent that is causing the most work.
After identifying these fragile assets and sensitive services, high-performing IT organizations proactively identify all IT assets and infrastructure items that make up the service and establish a policy as to how these assets will be treated. The policy specifically indicates that there will be no tolerance for unauthorized changes to these assets.
Second, strong performers take all necessary steps to ensure the fragile artifacts aren’t changed, which they already know is prone to cause catastrophic episodes of unplanned work. The evidence of this is everywhere: placing yellow sticky notes on the physical assets saying, “Don’t Touch,” deploying Configuration Audit and Control software to ensure that all employees follow the policy, and flagging these artifacts during the change authorization process. After all, what is the value of not making a change to a fragile IT item that we know has an average change success rate of 23 percent and an average repair time of 12 weeks? What is the value of averting that tremendous amount of unplanned work?
It is far more important to understand the dependencies between configuration items (CIs) and the service than where you put the information. The appropriate strategy for generating quick wins within 90 days is to focus on a few fragile artifacts, understand their dependencies and use that information to make better change decisions. When organizations realize this, they often think: “We don’t have anything better than a spreadsheet right now, but we will use it to get breakthrough results in 90 days, even though it is manual and definitely not pretty. After we generate some quick wins, we will have a far better idea of what the real requirements for a CMDB tool should be.”
Conversely, the approach that doesn’t work is spending 90 days creating flowcharts, developing process design, creating data models, debating on the team that will populate the CMDB and spending unnecessary time inventorying thousands of CIs that no one will even look at, let alone care about.
Third, high-performing organizations ensure that there are no unauthorized changes in their environment. Whenever a system crashes or fails, these organizations immediately suspect that there has been an unauthorized change and take decisive action, because there are consequences for making unauthorized changes.
This approach quickly sidelines repeat offenders, who are moved into roles where they are not able to make changes to production. The attitude in these organizations is: “Making careless changes may be fine for hobby, but it is not acceptable for our mission-critical infrastructure.”
When an outage or impairment for a service does happen, a high-performing organization focuses first and foremost on change, as they know that 80 percent of all service impairment is caused by unmanaged change. Said another way, change is ruled out first in the repair cycle.
To do this, high performers create a relevant timeline of all changes on the assets that the service relies upon. This timeline presents a list of authorized and scheduled changes from the change management system, as well as all detected changes on those systems from change and configuration monitoring tools. They can quickly determine whether change was a root cause. Because the vast majority of troubleshooting is spent identifying the cause of an issue, high performers can quickly identify the change that caused the problem, or rule it out as the source, and move forward to investigate other potential causes.
Last, high performers focus on creating repeatable, stable and secure builds for each fragile IT asset. This step helps an IT organization gain mastery of each configuration, create maximum time limits for MTTR and reduce the skill level needed to do repairs, deal with human error and handle unplanned work.
By replacing fragile IT assets with stable, secure builds, high performers drastically reduce the number of unique configurations, and, by extension, configuration mastery and change success rates climb. All this leads to making sure unplanned work in the organization stays at 5 percent or less. Typically, medium- and low-performing organizations spend between one-third and one-half of their time on unplanned work.
Returning to our fictional data center director, it becomes clear that although she was tasked with setting up a CMDB, the first step toward ensuring a secure and reliable data center is not the CMDB itself. In fact, analysts at Forrester, Enterprise Management Associates and Gartner agree that control over the change and configuration management processes must come first, or the CMDB will be like the proverbial house built on sand.
By following these steps, you can start down a path to create a culture of change management and causality. Monitoring for change, creating change authorization workflow tools and deploying a CMDB all help ensure organizations halt unauthorized or undocumented changes, escalate incidents appropriately and ensure that employees make the first fix appropriately and in the shortest amount of time. A CMDB alone will not solve these problems. In fact, if our data center director were to dive headlong into a CMDB implementation without first taking the other necessary steps, she would soon become a statistic, joining many other IT teams that tried to address these issues with technology without first establishing a culture of change management.