Steve Smith

On Tech

Page 5 of 10

IT as a Business Differentiator

How can Continuous Delivery power innovation in an organisation?

When an organisation is in a state of Continuous Delivery, its technology strategy can be described as IT as a Business Differentiator. IT staff will work in one or more product departments, which are accounted for as profit centres in which profits are generated from incoming revenues and outgoing costs. A profit centre provides services to customers, and is responsible for its own investment decisions and profitability.

IT as a Business Differentiator promotes IT to be a front office function. There will be a rolling budget, and a rolling plan consisting of dynamic product areas with scope, resources, and deadlines constantly refined by feedback. Long-lived, outcome-oriented delivery teams will implement experiments to find product/market fit for a particular business capability.

This is in direct contrast to Nicholas Carr’s 2003 proclamation that IT Doesn’t Matter, to which history has not been kind. Carr failed to predict the rise of Agile Development, Lean Product Development, and in particular Cloud Computing, which has commoditised many lower-order technology functions. These advancements have contributed to the ongoing software revolution termed “Software Is Eating The World” by Marc Andreessen in 2011, which has caused a profound economic and technological shift in society.

Continuous Delivery as the norm

IT as a Business Differentiator is an Internet-inspired, 21st century technology strategy in which IT contributes to uncovering new revenue streams that increase overall profitability for an organisation. This means executives and managers are incentivised to maximise revenue generating activities, as well as controlling cost generating activities.

Continuous Delivery is table stakes for IT as a Business Differentiator, as IT executives and managers are accountable for delays between ideation and customer launch. There will be an ongoing investment in technology and organisational change, to ensure deployment throughput meets market demand. There will be a focus on optimising flow by eliminating handoffs, reducing work in progress, and removing wasteful activities. The reliability strategy will be to Optimise For Resilience, in order to minimise failure response time and blast radius.

IT as a Business Differentiator and Continuous Delivery were validated by Dr. Nicole Forsgren et al, in the 2018 book Accelerate. Surveys of 23,000 people working at 2,000 organisations worldwide revealed:

  • Continuous Delivery results in high performance IT
  • High performance IT leads to simultaneous improvements in the stability and throughput of IT delivery, without trade-offs
  • High performance IT means an organisation is twice as likely to exceed profitability, market share, and productivity goals
  • Continuous Delivery also results in less rework, an improved organisational culture, reduced team burnout, and increased job satisfaction

Leaving IT As A Cost Centre

If an organisation has institutionalised IT as a Cost Centre as its technology strategy, moving to IT as a Business Differentiator would be difficult. It would require an executive-level decision, in one of the following scenarios:

  • Competition – rival organisations are increasing their market share
  • Cognition – IT is recognised as the engine of future business growth
  • Catastrophe – a serious IT failure has an enormously negative financial impact

If the executive leadership of the organisation agree there is an existential crisis, they should publicly commit to IT as a Business Differentiator. That should include an ambitious vision of success that explains the current crisis, describes a state of future economic prosperity, and injects a sense of urgency into the day-to-day work of personnel.

There is no recipe for moving from IT as a Cost Centre to IT as a Business Differentiator. As a complex, adaptive system, an organisation will have a dispositional state of time-dependent possibilities, rather than linear cause and effect. A continuous improvement method such as the Improvement Kata should be used to experiment with different changes. Experiments could include:

  • co-locating IT delivery teams with their product stakeholders
  • removing cost accounting metrics from IT executive incentives
  • creating a Digital department of product and IT staff, as a profit centre

This leaves the open question of whether an IT department should adopt Continuous Delivery before, during, or after a move from IT as a Cost Centre to IT as a Business Differentiator.

Further Reading

  1. The Principles Of Product Development Flow by Don Reinertsen
  2. Measuring Continuous Delivery by the author
  3. Lean Enterprise by Jez Humble, Joanne Molesky, and Barry O’Reilly
  4. Utility vs. Strategic Dichotomy by Martin Fowler
  5. Products Not Projects by Sriram Narayan

Acknowledgements

Thanks to Thierry de Pauw for his feedback.

IT as a Cost Centre

Why does Continuous Delivery encounter resistance from IT executives and managers in so many organisations, and why is it so difficult to implement? Why does IT as a Cost Centre results in long-term Discontinuous Delivery?

Introduction

When an organisation is in a state of Discontinuous Delivery, its organisational model can usually be described as IT as a Cost Centre. There will be a functionally segregated IT department, that is accounted for as a cost centre in which costs can be incurred but profits cannot be generated. The IT department will provide services to product departments, which are accounted for as profit centres responsible for investment decisions and profitability.

IT as a Cost Centre relegates IT to a back office function. Each year, the IT department will be allocated a fixed budget to deliver a set of projects required by the product departments. Each project will represent a large batch of pre-agreed requirements. Short-lived project teams will try to deliver those requirements with scope, resources, and deadlines all fixed in advance.

This dovetails with the traditional view of IT as a universal commodity, popularised by Nicholas Carr in IT Doesn’t Matter in 2003. Carr argued IT was merely an infrastructural technology, and concluded it was merely a cost of doing business that could not provide a sustained competitive advantage.

Cost accounting suzerainty

IT as a Cost Centre is a pre-Internet organisational model from the mid-20th century, that still persists today. The world is now reliant on, and connected by technology, yet in a 2014 survey of 700+ organisations by CIO 48% of organisations still had IT as a cost centre.

In The Cost Centre Trap, Mary Poppendieck traces the popularity of IT cost centres back to the ubiquity of cost accounting. Poppendieck describes how the performance of an IT cost centre is measured solely in terms of cost management. This means accounting metrics percolate into the performance metrics of IT executives and managers, creating a culture of cost control with little regard for product development or organisational performance. These incentives will be markedly different to those of product executives and managers, and usually contribute to an adversarial relationship between product departments and IT.

In cost accounting, inventory is tracked as an asset, maximum resource utilisation is encouraged, and development work is capitalised until production launch. These factors create a hidden bias towards the institutionalisation of large projects, due to their perceived economies of scale. In reality, the project delivery model is an ineffective, inefficient vehicle that impedes value, quality, and flow.

Discontinuous Delivery as the norm

A Continuous Delivery programme is an unending journey of continuous improvement, that requires a substantial investment in order to achieve a time to market that can improve product/market fit and increase revenues.

This is likely to be a hard sell in an organisation with IT as a Cost Centre. IT executives and managers will be incentivised to reduce costs wherever possible, while delivering projects that are supposedly on time, on scope, and on budget. As a result, there will be resistance to the idea of spending money on an internal programme with an explicit goal of improving organisational performance and no fixed end date.

Delivery teams will be short-lived, which encourages people to prioritise short-term feature work over long-term architectural work, which restricts the deployability and testability of different services. The reliability strategy will be to Optimise For Robustness, which increases lead times and failure blast radius. Furthermore, the lack of a mandate beyond development work will make it difficult to work with operations teams to establish consistent toolchains for deployments, logging, monitoring, and alerting.

In short, Dr. Eli Goldratt was right when he said in The Goal “if it comes from cost accounting, it must be wrong”.

Further Reading

  1. The Principles Of Product Development Flow by Don Reinertsen
  2. Measuring Continuous Delivery by the author
  3. Lean Enterprise by Jez Humble, Joanne Molesky, and Barry O’Reilly
  4. Utility vs. Strategic Dichotomy by Martin Fowler
  5. No Projects by the author

Acknowledgements

Thanks to Thierry de Pauw for his feedback

Deployment pipeline design and the Theory Of Constraints

How should you design a deployment pipeline? Short and wide, long and thin, or something else? Can you use a Theory Of Constraints lens to explain why pipeline flexibility is more important than any particular pipeline design?

TL;DR:

  • Past advice from the Continuous Delivery community to favour short and wide deployment pipelines over long and thin pipelines was flawed
  • Parallelising activities between code commit and production in a short and wide deployment pipeline is unlikely to achieve a target lead time 
  • Flexible pipelines allow for experimentation until a Goldilocks deployment pipeline can be found, which makes Continuous Delivery easier to implement

Introduction

The Deployment Pipeline pattern is at the heart of Continuous Delivery. A deployment pipeline is a pull-based automated toolchain, used from code commit to production. The design of a deployment pipeline should be aligned with Conway’s Law, and a model of the underlying technology value stream. In other words, it should encompass the build, testing, and operational activities required to launch new product ideas to customers. The exact tools used are of little consequence.

Advice on deployment pipeline design has remained largely unchanged since 2010, when Jez Humble recommended “make your pipeline wide, not long… and parallelise each stage as much as you can“. A long and thin deployment pipeline of sequential activities is easy to reason about, but in theory parallelising activities between build and production will shorten lead times, and accelerate feedback loops. The trade-off is an increase in toolchain complexity and coordination costs between different teams participating in the technology value stream.

For example, imagine a technology value stream with sequential activities for automated acceptance tests, exploratory testing, and manual performance testing. This could be modelled as a long and thin deployment pipeline.

If those testing activities could be run in parallel, the long and thin deployment pipeline could be re-designed as a short and wide deployment pipeline.

Since 2010, people in the Continuous Delivery community – including the author – have periodically recommended short and wide deployment pipelines over long and thin pipelines. That advice was flawed.

The Theory Of Constraints, Applied

The Theory Of Constraints is a management paradigm by Dr. Eli Goldratt, for improving organisational throughput in a homogeneous workflow. A constraint is any resource with capacity equal to, or less than market demand. Its level of utilisation will limit the utilisation of other resources. The aim is to iteratively increase the capacity of a constraint, until the flow of items can be balanced according to demand. The Theory Of Constraints is applicable to Continuous Delivery, as a technology value stream should be a homogeneous workflow that is as deterministic and invariable as possible.

When a delivery team is in a state of Discontinuous Delivery, its technology value stream will contain a constrained activity with a duration less than the current lead time, but too large for the target lead time. The duration might be greater than the target lead time, or the largest duration of all the activities. A short and wide deployment pipeline will not be able to meet the target lead time, as the duration of the parallel activities will be limited by the constrained activity.

In the above example, assume the current lead time is 14 days, and manual performance testing takes 12 days as it involves end-to-end performance testing with a third party.

Assume customer demand results in a target lead time of 7 days. This means the delivery team are in a state of Discontinuous Delivery, and a long and thin deployment pipeline would be unable to meet that target.

A short and wide deployment pipeline would also be unable to achieve the target lead time. The parallel testing activities would be limited by the 12 days of manual performance testing, and future release candidates would queue before the constrained activity. An obvious countermeasure would be for some release candidates to skip manual performance testing, but that would increase the risk of production incidents.

This means long and thin vs. short and wide deployment pipelines is a false dichotomy.

Pipeline Design and The Theory Of Constraints

In The Goal, Dr. Eli Goldratt describes the Theory Of Constraints as an iterative cycle known as the Five Focussing Steps: identify a constraint, reduce its wasted capacity, regulate its item arrival rate, increase its capacity, and then repeat.

If the activities in a technology value stream can be re-sequenced, re-designing a deployment pipeline is one way to reduce wasted time at a constrained activity, and regulate the arrival of release candidates. Pipeline flexibility is more important than any particular pipeline design, as it enables experimentation until a Goldilocks deployment pipeline can be found.

The constrained activity should not be the first activity after release candidate creation. This would reduce subsequent release candidate queues, and statistical fluctuations in unconstrained activities. However, constraint time should never be wasted on items with knowable defects, and most activities in a deployment pipeline are testing activities.

One Goldilocks deployment pipeline design is for all unconstrained testing activities to be parallelised before the constrained activity. This should be combined with other experiments to save constraint time, and regulate the flow of release candidates to minimise queues and statistical fluctuations. Such a pipeline design will make it easier for delivery teams to successfully implement Continuous Delivery.

In the above example, assume the short and wide deployment pipeline can be re-designed so manual performance testing occurs after the other parallelised testing activities. This ensures release candidates with knowable defects are rejected prior to performance testing, which saves 1 day in queue time per release candidate. End-to-end performance testing scenarios are gradually replaced with stubbed performance tests and contract tests, which saves 6 days and means the target lead time can be accomplished. 

If there is no constrained activity in a technology value stream, the delivery team is in a state of Continuous Delivery and a constraint will exist either upstream in product development or downstream in customer marketing. Further deployment pipeline improvements such as automated filtering of test scenarios could increase the speed of release candidate feedback, but the priority should be tackling the external constraint if product cycle time from ideation to customer is to be improved.

Acknowledgements

Thanks to Thierry de Pauw for his feedback on this article.

Continuous Delivery and the Theory Of Constraints

How should you actually implement Continuous Delivery?

Adopting Continuous Delivery takes time. You have a long list of technology and organisational changes to consider. You have to work within the unique circumstances of your organisation. You’re constantly surrounded by strange problems, half-baked theories, off the shelf solutions that just don’t work, and people telling you Amazon is nothing to worry about

How do you identify and remove the major impediments in your build, testing, and operational activities? How do you avoid spending weeks, months, or years on far-reaching changes that ultimately have no impact on your time to market?

TL;DR:

  • Continuous Delivery means applying technology and organisational changes to the unique circumstances of an organisation
  • If a Continuous Delivery programme does not focus on the activities with the most rework and/or queue times, there is a high probability of sub-optimal outcomes
  • The Theory Of Constraints is a management paradigm for improving organisational throughput, while simultaneously decreasing both inventory and operating expense.
  • The Theory Of Constraints can be applied to Continuous Delivery, as the build, testing, and operational activities in a technology value stream should be homogeneous
  • The Five Focussing Steps can be used to identify constrained activities, and then introduce the necessary technology and organisational changes to reduce rework and/or queue times

Introduction

Technology can bring benefits if, and only if, it diminishes a limitation – Dr. Eli Goldratt

An organisation will have one or more technology value streams. Each one will represent a sequence of activities that converts business ideas into value-adding software for customers. The first part of a technology value stream will be design and development activities, which are inherently non-deterministic and highly variable. The second part will be build, testing, and operational activities, which should be as deterministic and invariable as possible. When an organisation lacks the necessary stability and throughput in its technology value streams to satisfy market demand, it is in a state of Discontinuous Delivery.

Continuous Delivery is a set of holistic principles and practices to improve the stability and throughput of IT delivery. It needs a substantial investment in technology changes:

  • Version Control: put code, configuration, infrastructure definitions, migration scripts, etc. in version control to preserve change history
  • Environments: automate self-service, on-demand provisioning of short-lived test environments and incremental release patterns such as Canary Deployments
  • Development: use Test Driven Development with Pair-Programming to build quality into a codebase, and use Trunk Based Development with Continuous Integration to ensure it is always releasable
  • Architecture: establish an Evolutionary Architecture to encourage loosely-coupled, independently deployable services
  • Testing: automate parallelisable acceptance tests with dynamic test data, use Smoke Testing to validate deployments, and use Exploratory Testing to uncover new information
  • Operability: aggregate logs and metrics to create a centralised telemetry platform for visualising normal conditions, and alerting on abnormal conditions

It also requires a significant amount of organisational changes:

  • Batch Size: reduce features per release candidate to decrease variability, feedback delays, risk, overheads, inefficiencies, and lead time via Little’s Law
  • Management: devolve decision making to the employees closest to the work to empower better choices in design, testing etc.
  • Culture: grow a performance-oriented culture of cooperation and collaboration to increase information flow between teams
  • Responsibilities: change responsibilities for testing and on-call production support to encourage a shared sense of accountability
  • Risk: use continuous code reviews via Pair-Programming and traceability of production changes to replace Risk Management Theatre with adaptive risk assessment
  • Skills: invest in employee training to ease the introduction of new practices and tools e.g. infrastructure automation
  • Structures: convert siloed functional teams into cross-functional teams, and align team and software architecture with Conway’s Law to enable faster delivery

These changes are challenging in isolation, but it is their application to the unique circumstances and constraints of an organisation that makes Continuous Delivery so difficult. Adopting Continuous Delivery means implementing widespread changes in the highly uncertain conditions of a complex, adaptive system, in which behaviours are emergent and interactions are unpredictable.

A Continuous Delivery programme should aim to reduce rework and queue times in a technology value stream, as they are the most common sources of waste. Rework is any activity that must be repeated due to a failure, such as a tester being given a defective release candidate. Queue time is time spent queuing for an activity, such as a deployment waiting for a tester to become available. There are many potential causes of rework and queue time, including the following:

Cause Description
Snowflake environments Manual environment provisioning
Brittle deployments Unreliable code, config, or infra changes
Environment contention Slow access to test data, services, or environments
End-To-End Testing Reams of slow, non-deterministic tests
Rigid architecture Application deployments coupled together
Excessive toil Manual, repetitive build/testing/operations tasks

Over the years, the advice on how to implement Continuous Delivery has evolved. The Continuous Delivery book suggested using the Plan-Do-Check-Act Cycle to experiment with changes, and implement a maturity model. Lean Enterprise recommended the Improvement Kata be used to create a Continuous Improvement framework, with Plan-Do-Check-Act used to structure different experiments. This can be very effective, but there needs to be a way to focus on the activities where a reduction in rework and/or queue times will have an immediate, positive impact. Otherwise, there is a high probability of sub-optimal outcomes, stakeholder dissatisfaction, and Discontinuous Delivery remaining the norm.

Given sufficient Continuous Delivery expertise, heuristics can be used to locate those activities in a technology tech value stream. An alternative method is needed when those heuristics are unavailable, or insufficient to overcome resistance to change.

Example – Network Activation

Think of an 18 month Continuous Delivery programme for a multi-team, multi-application Network Activation platform. The delivery teams already use Pair-Programming, Test-Driven Development, Trunk Based Development, and Continuous Integration. The on-prem technology value stream has 2 pre-production activities – Functional Test and Integration Test – and a production deployment involves 10 semi-automated and manual tasks. Problems include snowflake environments, brittle deployments, and End-To-End Testing. There is no data on deployments, beyond an average lead time to production of 34 days. The product owner wants it to be 14 days.

Over 18 months there is a concerted effort to create a consistent deployment pipeline, with the Cost of Delay used to prioritise pipeline features. Aggregate Artifacts and Artifact Containers are used to automate deployments, infrastructure provisioning errors are eliminated, and pipeline dashboards are built. Verify Branch By Abstraction is used to incrementally decouple applications, by replacing object serialisation with RESTful APIs. API Examples and Consumer Driven Contracts are used to verify application interactions. There are also a series of organisational changes to policies, processes, and teams, to reduce queue time for Functional Test and Integration Test.

These changes produce many benefits, including a 95% build time improvement from 1 hour to 3 minutes, and a 43% improvement in test deployment time from 3.5 hours to 2.5 hours. However, there is no impact whatsoever on the average lead time of 34 days. On that basis, the Continuous Delivery programme is unsuccessful.

The Theory Of Constraints

In The Goal, Dr. Eli Goldratt sets out the Theory Of Constraints. The Theory Of Constraints is a management paradigm for improving organisational throughput, by optimising a small number of constraints in a homogeneous workflow. It states the goal of an organisation is to make money, by increasing throughput while simultaneously decreasing both inventory and operating expense.

A constraint is any resource with capacity equal to or less than market demand, and it will limit the throughput of the organisation. A constraint is internal when market demand is greater than throughput, and external when throughput is greater than market demand. An hour lost at a constraint is an hour lost for the whole organisation.

A non-constraint is any resource with capacity greater than market demand. The level of utilisation of a non-constraint is determined by a constraint elsewhere. An increase in non-constraint capacity is a waste of time and money, as it can only increase queued work before a constraint and cannot reduce work starvation after a constraint. An hour lost at a non-constraint is an illusion.

The central premise of the Theory Of Constraints is to iteratively increase the capacity of a constraint, until the flow of work items can be balanced according to market demand. The Five Focussing Steps are used to achieve this:

  1. Identify the constraint(s): calculate market demand for a product, and which activities have capacity less than or equal to demand
  2. Exploit the constraint(s): increase constraint utilisation by eliminating idle time, and time wasted on lower priority and/or defective work items
  3. Subordinate everything else to the constraint(s): protect a constraint from excessive inventory or work starvation by maintaining a buffer of materials, and regulating the arrival of new work items into the system based on constraint capacity
  4. Elevate the constraint(s): offload constraint work items to a non-constraint by investing in additional equipment, people, and/or allocation of work items to third parties
  5. If a constraint has been broken, go back to step 1, but do not allow inertia to cause a constraint

Continuous Delivery and the Theory Of Constraints

The Theory Of Constraints can be applied to Continuous Delivery, as the build, testing, and operations activities in a technology value stream should be homogeneous. There will be one, or at most a few constrained activities, caused by rework and/or queue times. The Five Focussing Steps can be used to identify and optimise the constraint(s), until the flow of release candidates is balanced from mainline commit to production and Continuous Delivery is achieved. At that point, an external constraint will emerge upstream in product development or downstream in sales. If market demand increases later on, a new internal constraint might surface within the technology value stream.

Identify the constraint

A constraint is identified by establishing the market demand for a product, and then calculating how much time each activity contributes to meeting that demand. Overall market demand should be specified by the product owner as a target Lead Time lt, and measured from mainline commit to production launch.

lt = X minutes/hours/days

The ability of an activity a n to meet lt should be expressed as its Activity Time at n. It can be measured as the median between its finish time aft n and the finish time of the prior activity aft n – 1, for all occurrences of a n in a time period t. This measurement ensures queue time and process time are both accounted for. The measurement unit should be minutes, hours, or days based on lt. It might be necessary to measure variability in Activity Time as well, if one or more activities exhibit an undesirable amount of variability.

at n = median ( ( aft n - aft n - 1 ) for a n in t )

If an activity has at n greater than lt, it is a constraint. If a high percentage of release candidates consistently fail to pass the activity on the first attempt, it is unstable and rework must be reduced. Alternatively, if release candidates are regularly delayed before starting the activity, it is too slow to begin and queue time must be reduced.

If no activity has at n greater than lt, there is no internal constraint. This means lt is not ambitious enough, or there is an external constraint outside the technology value stream.

Applying the Theory Of Constraints to the Network Activation platform with its target lead time of 14 days reveals a constraint. Visualising each activity as queue time and process time shows Production queue time was often greater than 14 days. This was caused by a well-intentioned release management policy of deploying release candidates into production when ready, and then waiting for a launch date pre-agreed with an upstream billing provider months earlier. Launch dates were delayed when testing finished late, but never brought forward when testing finished early. 18 months of hard work reduced variability, Functional Test queue time, and Integration Test queue time, but those improvements merely led to an increase in Production queue time until launch dates occurred.

Optimise the constraint

A constraint is optimised by experimenting with the full range of technology and organisational changes recommended by Continuous Delivery. There needs to be a concerted effort to reduce the average and variation in Activity Time for the constrained activity, until the target Lead Time can be satisfied.

Exploiting a constrained activity means ensuring it is always working, and always doing valuable work. This aligns with Continuous Delivery emphasising the automation of repetitive tasks to free up people for higher-order problems. Automated infrastructure provisioning, automated unit and acceptance testing, and moving to an Evolutionary Architecture are all examples of technology changes to reduce time spent at a constrained activity. The most effective organisational change is to reduce batch sizes, as a smaller batch will have shorter process times and queue times.

Subordinating unconstrained activities means limiting the flow of mainline commits, builds, and deployments to match the constrained activity. This is best accomplished with Work In Process (WIP) Limits, which encourage people to collaborate on a few work items at any point in time. Stop The Line teaches people to prioritise a releasable codebase over developing more features, and eXtreme Programming practices such as Test-Driven Development and Pair-Programming can also foster a shared team cadence.

Elevating a constrained activity means investing in people and tools to increase its capacity. This might be achieved with technology changes, such as a move to short-lived test environments via fully automated infrastructure provisioning. It might also involve organisational changes such as paying third party suppliers for more test data, hiring more engineers, or running training courses to improve Staff Liquidity.

For instance, an assortment of changes should be tried if End-To-End Testing activity is a constraint. Time wasted should be eliminated by ensuring testers are uninterruptible, making more testers available, and running testing slots 24 hours a day. Defective release candidates should be minimised by moving all other testing activities in front of End-To-End Testing, adding automated contract tests, and rejecting release candidates with one or more test failures. Over time the end-to-end tests should be incrementally replaced with automated acceptance tests, monitoring dashboards, and automated anomaly detection.

Example: MediaTech

Consider a 7 month Continuous Delivery programme for a multi-team, multi-application MediaTech platform. The delivery teams use Trunk Based Development with some Pair Programming, but there is little Test-Driven Development. The on-prem technology value stream has 2 pre-production activities – Functional Test and Release Test – and a production deployment involves 60 semi-automated and manual tasks. Problems include snowflake environments, brittle deployments, environment contention, a rigid architecture, End-To-End Testing, and excessive toil. There is no data on deployments, beyond an average interval of 3 weeks. The product owner has set a target lead time of 1 week, at an interval of 1 week.

The range of problems at MediaTech makes Continuous Delivery very challenging. There is a proposal to create a fully automated deployment pipeline, with infrastructure provisioning, deployments, and telemetry. Other suggested options include the promotion of Test-Driven Development, implementing zero downtime deployments, and immutable release candidates. However, there are concerns about the time needed to make such changes.

After 2 months, deployment data is scraped from chat channels, and applying the Five Focussing Steps shows sweeping changes are not immediately required. The constraint on a target lead time of 1 week is Release Test process time, due to perennial environment instability caused by configuration and test data issues. Interestingly, there is no obvious constraint on a target lead time of 2 weeks. As a result, the product owner agrees to a target lead time of 2 weeks at an interval of 2 weeks, as an intermediate milestone.

In the Theory Of Constraints, a lack of skilled people can constitute a constraint. When the above data is shared with the MediaTech operations team, they reveal an unseen constraint on the 14 day target lead time. There is a single release planner, who endures a heavy workload for each production deployment on top of their day to day work. Everyone agrees the Continuous Delivery programme needs to focus on automating the release planning tasks, and improving the existing automated tests.

Over the next few months, the release planning workload is reduced and the automated tests are stabilised. Callout rota emails by the release planner are replaced by a shared rota, co-owned by the delivery teams. A release note process overseen by the release planner involving 3 different teams is replaced by a fully automated release note tool. Furthermore, end-to-end tests in Integration Test and Release Test are prioritised by their non-determinism, and gradually rewritten as build-time acceptance tests. The target lead time of 2 weeks at an interval of 2 weeks is successfully achieved, and after several production deployments the product owner decides the new throughput is sufficient. The Continuous Delivery programme is considered a success, and without a fully automated deployment pipeline.

Further Reading

  1. Beyond The Goal by Dr. Eli Goldratt
  2. Measuring Continuous Delivery by the author
  3. Resilience as a Continuous Delivery Enabler by the author

Acknowledgements

Thanks to Thierry de Pauw for his feedback on this article.

Resilience as a Continuous Delivery enabler

Why does optimising for robustness leave organisations in a state of Discontinuous Delivery, and vulnerable to failure? How does optimising for resilience improve reliability, and how can it encourage the adoption of Continuous Delivery?

The Resilience as a Continuous Delivery Enabler series:

  1. The cost and theatre of Optimising For Robustness
  2. When Optimising For Robustness fails
  3. The value of Optimising For Resilience
  4. Resilience as a Continuous Delivery enabler

TL;DR:

  • Optimising For Robustness – prioritising MTBF over MTTR – is an antiquated, flawed approach to IT reliability that results in Discontinuous Delivery and an operational brittleness that begets failure
  • If an organisation has previously optimised for robustness, a Continuous Delivery programme focussed on throughput is unlikely to succeed
  • Optimising For Resilience – prioritising MTTR over MTBF – is a superior reliability strategy that enables an organisation to gracefully extend to limit the impact of failures, and position itself for sustained adaptability
  • Resilience As A Continuous Delivery Enabler is a heuristic that advocates resilience as the focus of a Continuous Delivery programme
  • Improving the resilience of services makes it easier to reduce Risk Management Theatre, and gradually adopt Continuous Delivery

The tradition of robustness

As software continues to eat the world, organisations must have reliable IT services at the heart of their business if they are to innovate in rapidly changing markets. Reliability is defined by Patrick O’Connor and Andre Kleyner in Practical Reliability Engineering as “the probability that [a system] will perform a required function without failure under stated conditions for a stated period of time“, or as a function of Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR).

The traditional IT reliability strategy is Optimising For Robustness. This means prioritising a higher MTBF over a lower MTTR for IT services, by attempting to maintain a failure-free production environment. It is based on the belief that a production environment is a complicated system, in which services are homogeneous processes with predictable interactions in repeatable conditions. Failures are believed to be caused by isolated, faulty changes and are considered entirely preventable. When an organisation optimises for robustness, it will usually rely upon:

  • End-To-End Testing to verify the functionality of a new service version against its unowned dependent services
  • Change Advisory Boards to assess, prioritise, and approve the deployment of new service versions
  • Change Freezes to restrict the deployment of new service versions for a period of time due to market conditions

These practices are inherently slow, and a form of Risk Management Theatre 1. End-To-End Testing incurs long execution times and significant maintenance time, and defects can still occur. Change Advisory Boards involve slow approval times, and deployments can still fail. Change Freezes cause huge productivity impediments, and failures can still happen. In addition, the long deployment lead times caused by robustness practices ensure a large batch of requirements and technology changes per release, which actually increases the risk of failure 2.

Optimising For Robustness constrains the stability and throughput of IT delivery such that business demand cannot be satisfied. It is the predominant reason why so many organisations are trapped in a state of Discontinuous Delivery.

The constancy of failure

Ironically, Optimising For Robustness leaves an organisation ill-equipped to deal with failure. In Resilience and Precarious Success, Mary Patterson and Robert Wears describe how “fundamental goals (such as safety) tend to be sacrificed with increasing pressure to achieve acute goals (faster, better, and cheaper)“. When an organisation optimises for robustness it will under-invest in its production environment, resulting in unimplemented “non-functional” requirements, inadequate telemetry 3, snowflake infrastructure, and a fragile service architecture. This will be considered acceptable, as failures are expected to be rare.

However, it is naive to think of a production environment of running services as a complicated system. A production environment is an intractable mass of heterogeneous processes, with unpredictable interactions occurring in unrepeatable conditions. It is a complex system of emergent behaviours, in which the cause and effect of an event can only be perceived in retrospect. Furthermore, as Richard Cook explains in How Complex Systems Fail the complexity of these systems makes it impossible for them to run without multiple flaws“. A production environment is perpetually in a state of near-failure.

A failure occurs when multiple faults unexpectedly coalesce such that one or more business operations cannot succeed. It will create a revenue cost expressed as a function of cost per unit time and duration, and in an organisation optimised for robustness the impact can be considerable. The sunk cost incurred until failure detection can be high, as unimplemented “non-functional” requirements and inadequate telemetry will restrict situational awareness. The opportunity cost until failure resolution can also be high, as snowflake infrastructure and a fragile architecture will increase failure blast radius. In addition, the loss of customer confidence and increased failure demand will create further opportunity costs.

Consider a Fruits-U-Like website optimised for robustness. Its third party registration service begins to suffer under load, and new customers are rejected on checkout. The failure has a static cost per day of £80k, but with no telemetry the failure is not detected for 3 days. The checkout team then produces a hotfix within a day, and it is deployed the following day. The revenue cost is £400K, with a £240K sunk cost and a £160K opportunity cost.

Optimising For Robustness encourages an attitude Sidney Dekker calls the Bad Apple Theory, in which a system is considered absolutely reliable except for the actions of unreliable employees. When a failure occurs, the combination of the Bad Apple Theory and hindsight bias will produce an oppressive culture of naming, blaming, and shaming the individuals involved. This discourages knowledge sharing and collaboration.

An interesting consequence of Optimising For Robustness is Dual Value Streams. An organisation optimised for robustness will have feature value streams with deployment lead times of weeks or months. When a failure is detected its sunk cost will create urgency, and people will want to immediately minimise the opportunity cost duration. That will lead to robustness practices being sacrificed for speed, in a truncated fix value stream with an MTTR of hours or days 4. The robustness practices omitted from the fix value stream should be considered theatre until proven otherwise.


Continuous Delivery improves the stability and throughput of IT delivery, but it is hard. A Continuous Delivery programme in an organisation optimised for robustness will not succeed if it is focussed solely on throughput. The most significant accelerator of deployment lead time will likely be the removal of robustness risk management theatre, but practices like End-To-End Testing will be woven into the fabric of the organisation 5. If they are forcibly removed, Continuous Delivery will be blamed for the first subsequent production failure. Resisters will lobby for more robustness practices, and a return to the status quo is all but inevitable. Unfortunately, it only takes one inopportune failure for a Continuous Delivery programme to be cancelled.

The value of resilience

A far more effective reliability strategy is Optimising For Resilience. This means prioritising a lower MTTR over a higher MTBF for IT services, by rapidly responding to failures in a production environment. Some classes of failure should never occur, some failures are more costly than others, and some safety-critical systems should never fail, but in general organisations should adhere to John Allspaw’s advice that “being able to recover quickly from failure is more important than having failures less often“.

Resilience can be thought of as graceful extensibility. In Four Concepts for Resilience and their Implications for Systems Safety in the Face of Complexity, David Woods describes graceful extensibility as “the ability of a system to extend its capacity to adapt when surprise events challenge its boundaries“. The graceful extensibility of a system is derived from its adaptive capacity, which represents the capacity for adaptation when a failure occurs.

Erik Hollnagel et al break down resilience in Resilience Engineering In Practice using a conceptual model known as the Four Cornerstones of Resilience:

The cornerstones are non-linear, complementary aspects of resilience:

  • Anticipation is imagining the potential for future failures, and countering those scenarios in advance
  • Monitoring is inspecting operating conditions, and alerting when anomalies occur
  • Response is using guidelines, heuristics, improvisation, and situational awareness to mitigate a failure
  • Learning is understanding the circumstances of a near-miss or failure, and sharing the observations

Optimising For Resilience means creating a production environment in which running IT services can gracefully extend to deal with the unpredictable behaviours, unexpected changes, and periods of failure that will inevitably occur. When a service has sufficient adaptive capacity the cost per unit time and duration of production failures can potentially be minimised, reducing the direct revenue costs and indirect opportunity costs caused by a failure.

A lower MTTR can be achieved by investing in the operability of IT services. Operability is defined as “the ability to keep a system in a safe and reliable functioning condition“, and is associated with a set of practices:

Each of these will increase the capacity of a service to adapt to unexpected operating conditions, and produce a more effective incident response:

  • Development: an Adaptive Architecture limits the blast radius of a failure, and Feature Toggles allow features to be limited, tested in isolation, or turned off on failure
  • Testing: Smoke Testing verifies service health, and Chaos Engineering uncovers latent failures in production
  • Infrastructure: Automated Provisioning creates reproducible environments, and Self-Healing automatically restores failed service instances
  • Telemetry: Logging radiates data on traffic, errors, latency, and saturation, and Monitoring visualises service metrics and events in a time series. Anomaly detection identifies events that breach normal operating conditions and Alerting notifies operators of abnormalities to act on. User analytics show success rates for user journeys
  • People: Shared On-Call fosters a “You Build It, You Run It” culture and increases situational awareness, and Runbooks are a repository for operational knowledge. Blameless Post-Mortems uncover the multiple contributors to a near-miss or failure and suggest future preventative measures, while respecting the best efforts of individuals and the dangers of hindsight bias 1

If Fruits-U-Like was optimised for resilience its checkout team could receive an alert within 5 minutes of third party registration errors. A Circuit Breaker would allow some registrations to succeed, and a Bulkhead could trigger an anonymous checkout for failed registrations. This could decrease the cost per day to £5K, and a hotfix could be deployed within 3 hours. The revenue cost would be £625, with a £18 sunk cost and a £607 opportunity cost.

Optimising For Resilience sets the foundation for an organisation to act on market disruption and innovate. Once an organisation has the required level of graceful extensibility, it can continue to invest in its people and technology to achieve sustained adaptability. Sustained adaptability has been described by David Woods as “the ability to adapt to future surprises as conditions continue to evolve“, and can be thought of as innovation capability. An organisation that can quickly adapt to unexpected business events will hold a powerful First Mover Advantage over its competitors.

Resilience as a Continuous Delivery enabler

There is no recipe for success with Continuous Delivery, as every organisation is a complex, adaptive system with its own circumstances and constraints. However, if an organisation has previously optimised for robustness and is in a state of Discontinuous Delivery there is a heuristic that can be used:

Resilience as a Continuous Delivery enabler

This can be applied to bootstrapping Continuous Delivery:

This bootstrap sequence can guide the formative steps of a Continuous Delivery programme, and build confidence throughout an organisation. It demonstrates a commitment to stability, transparency, and reliability which will help to win over resisters. Storing all code, configuration, infrastructure definitions, documents, scripts etc. in version control eliminates the predominant source of failure demand. Creating stability and throughput indicators helps people to understand their delivery capabilities, and make better decisions 7.

Improving production reliability minimises the cost of failure, and lays the groundwork for challenging robustness risk management theatre later on. Automated anomaly detection and alerting will speed up the detection time of an anticipated failure, reducing its sunk cost duration to seconds or minutes. An adaptive architecture will limit the blast radius of a failure, decreasing both cost per unit time and duration.

Implementing production telemetry early on also provides insurance for unsafe-to-fail situations. Logging, monitoring, and analytics dashboards can identify the contributing technical faults to a failure, and when they first entered production. If resisters blame Continuous Delivery for a failure, the data will pinpoint which faults were recent and which were lying dormant in production beforehand.

Once the Continuous Delivery programme reaches the experimentation phase, other sources of adaptive capacity can be created with operability practices such as Capacity Planning, Self-Healing, Shared On-Call, and Blameless Post-Mortems. At the same time, the programme should widen its focus to include deployment throughput as well as deployment stability and production resilience.

The end of theatre

The key to removing robustness risk management theatre is to visualise its costs to stakeholders and offer a practical alternative, rather than rely on theoretical arguments about wait times or defect discovery rates. Using the Resilience As A Continuous Delivery Enabler heuristic ensures a Continuous Delivery programme can supply those visualisations, and outline an alternative approach from the outset.

Stakeholders should be made aware of their robustness risk management theatre with a showcase of the delivery awareness and production reliability improvements so far. The stability and throughput indicators will illustrate the historical cost of robustness practices, by visualising the disparity between deployment lead times and MTTR in the Dual Value Streams. Some carefully calibrated Chaos Engineering in a test environment 8 will demonstrate how MTTR has been shrunk to minutes or hours, by showing how failures can be managed with the new production telemetry and adaptive architecture. An MTTR an order of magnitude faster than deployment lead times will show stakeholders what a team can accomplish with minimal robustness practices.

Each robustness practice subsequently agreed to be risk management theatre should be incrementally replaced with the appropriate mix of Continuous Delivery and operability practices. End-To-End Testing should be superseded by a multi-faceted testing portfolio, in order to turn the resident testing strategy from a Test Ice Cream Cone into a Test Pyramid. This will reduce test execution times and maintenance costs, while simultaneously improving defect discovery rates:

Practice Quantity Frequency Duration Environment
Unit Testing 100 to 1000+ Per build < 30s total Local and Build
Acceptance Testing 10 to 100+ Per build < 10m total Local and Build
Exploratory Testing 10 to 100+ Per build Timebox Local and 3rd Party
Contract Testing ~20 Per 3rd party deploy < 1m 3rd Party
Smoke Testing ~5 Per deploy < 5m All
Monitoring 10 to 100+ Always < 10s All
Anomaly detection 10 to 100+ < 1m < 10s All
Adaptive architecture N/A Always N/A All

Change Advisory Boards and Change Freezes should end in favour of incremental deployments and incremental launches. Blue Green Deployments and Canary Deployments gradually direct users to a newly deployed service version, and users can be redirected to the old version on service failure. Dark Launching controls feature rollouts based on user demographics, and services can be operated in a degraded state on feature failure. Lightweight change management conversations should be reserved for unavoidably large releases, or turbulent market conditions.

Summary

Optimising For Robustness is an antiquated, flawed approach to IT reliability that results in long-term Discontinuous Delivery and an operational brittleness that begets failure. As John Allspaw has stated, reliability is “the presence of adaptive capacity, not the absence of failures“. Robustness is of value, but it must be rejected as an outcome if an organisation wants to innovate in changing markets.

Optimising For Resilience is a superior reliability strategy that enables an organisation to gracefully extend to limit the impact of failures, and position itself for sustained adaptability. It is a paradigm shift, in which people need to accept the inherent complexity within their IT services and the hard truth that failures are inevitable. This is neatly summarised by David Woods’ assertion that “graceful extensibility trades off with robust optimality“. An organisation optimised for robustness will reject sources of adaptive capacity such as Circuit Breakers as inefficiencies, but to an organisation optimised for resilience its graceful extensibility is more important than cost efficiencies.

If an organisation has optimised for robustness a Continuous Delivery programme focussed on throughput alone is unlikely to succeed. Resilience As A Continuous Delivery Enabler is a heuristic that advocates resilience as the focus of Continuous Delivery, and using it to bootstrap a Continuous Delivery programme improves production reliability from the outset. Improving the resilience of services by an order of magnitude makes it easier to offer a series of practical alternatives to robustness risk management theatre, and reduce deployment throughput until there is a single value stream that can satisfy business demand 9.

1 Other robustness practices include manual regression testing, segregation of duties, artificial deployment limits, and uptime incentives

2 The Principles of Product Development Flow by Don Reinertsen describes in detail how large batch sizes increase risk

3 The DevOps Handbook by Patrick Debois et al defines telemetry as a logical grouping of logging, monitoring, anomaly detection, alerting, and user analytics

4 In ITIL these are termed Normal and Emergency Changes

5 The Anxiety Of Learning by Edgar Schein describes how people resist change due to learning and survival anxieties

6 How Complex Systems Fail by Richard Cook explains why hindsight bias is such an obstacle to understanding failures, and why root causes do not exist

7 Measuring Continuous Delivery by the author details how to measure the stability and throughput of IT delivery

8 Chaos Engineering should be restricted to test environments in an unsafe-to-fail culture

9 In ITIL these are termed Standard Changes

Acknowledgements

This series is indebted to John Allspaw and Dave Snowden for their respective work on Resilience Engineering and Cynefin.

Thanks to Beccy Stafford, Charles Kubicek, Chris O’Dell, Edd Grant, Daniel Mitchell, Martin Jackson, and Thierry de Pauw for their feedback on this series.

The value of Optimising For Resilience

What does it mean to optimise for resilience? Why is resilience so valuable to an organisation, and how can operability contribute to the adaptive capacity of IT services?

This is part of the Resilience As A Continuous Delivery Enabler series:

  1. The cost and theatre of Optimising For Robustness
  2. When Optimising For Robustness fails
  3. The value of Optimising For Resilience
  4. Resilience as a Continuous Delivery enabler

The value of resilience

When an organisation wants to improve the reliability of its IT services it should Optimise For Resilience. Resilience is the ability to “absorb or avoid damage without suffering complete failure“, and it is immensely valuable in IT. A production environment is a complex system of partial failures in which the potential for catastrophe is ever-present, so an ability to resist failure is vital.

Resilience can be thought of as graceful extensibility. In Four Concepts for Resilience and their Implications for Systems Safety in the Face of Complexity, David Woods describes graceful extensibility as “the ability of a system to extend its capacity to adapt when surprise events challenge its boundaries“. The graceful extensibility of a system is derived from its adaptive capacity, which represents the capacity for adaptation when a failure occurs.

Erik Hollnagel et al break down resilience in Resilience Engineering In Practice using a conceptual model known as the Four Cornerstones of Resilience:

The cornerstones are non-linear, complementary aspects of resilience:

  • Anticipation is knowing what to expect. This is imagining the potential for future failures, and mitigating for those scenarios in advance
  • Monitoring is knowing what to look for. This is inspecting past and present operating conditions, and alerting when anomalies occur
  • Response is knowing what to do. This is using guidelines, heuristics, improvisation skills, and situational awareness to mitigate a failure
  • Learning is knowing what has happened. This is understanding the circumstances of a near-miss or failure, and sharing the observations

Creating adaptive capacity with Operability

Optimising For Resilience means creating a production environment in which running IT services can gracefully extend to deal with the unpredictable behaviours, unexpected changes, and periods of failure that will inevitably occur. When a service has sufficient adaptive capacity the cost per unit time and duration of production failures can potentially be minimised, reducing the direct revenue costs and indirect opportunity costs caused by a failure.

The adaptive capacity of IT services can be increased by explicitly prioritising a lower Mean Time To Repair (MTTR) over a higher Mean Time Between Failures (MTTR). Some classes of failure should never occur, some failures are more costly than others, and safety-critical services should never have failures, but in general organisations should adhere to John Allspaw’s advice that “being able to recover quickly from failure is more important than having failures less often“.

A lower MTTR can be achieved by investing in the operability of IT services. Operability is defined as “the ability to keep a system in a safe and reliable functioning condition“, and is associated with a set of practices:

Each of these will increase the capacity of a service to adapt to unexpected operating conditions, and produce a more effective incident response:

  • Development: an Adaptive Architecture limits the blast radius of a failure, and Feature Toggles allow features to be limited, tested in isolation, or turned off on failure
  • Testing: Smoke Testing verifies service health, and Chaos Engineering uncovers latent failures in production
  • Infrastructure: Automated Provisioning creates reproducible environments, and Self-Healing automatically restores failed service instances
  • Telemetry: Logging radiates data on traffic, errors, latency, and saturation, and Monitoring visualises service metrics and events in a time series. Anomaly detection identifies events that breach normal operating conditions and Alerting notifies operators of abnormalities to act on. User analytics show success rates for user journeys
  • People: Shared On-Call fosters a “You Build It, You Run It” culture and increases situational awareness, and Runbooks are a repository for operational knowledge. Blameless Post-Mortems uncover the multiple contributors to a near-miss or failure and suggest future preventative measures, while respecting the best efforts of individuals and the dangers of hindsight bias 1

For example, incident response at Fruits-U-Like would be much improved if the organisation was optimising for resilience. Assume its third party registration service starts to struggle under load, new customers cannot checkout their purchases, and the failure cost per unit time is £80K per day. The checkout team would receive an automated alert for the failure, and their logging and monitoring dashboards would show a correlation between checkout and registration failures. The team would be able to triage a third party registration error within 5 minutes, and self-deploy an improvement to connection handling within a day. The failure would have a 1 day repair cost of £80K, with a detection sunk cost of £278 and a remediation opportunity cost of £79,722.

If the checkout team implemented an Adaptive Architecture they could combine a Circuit Breaker, a Bulkhead, and a Feature Toggle in anticipation of registration errors. If the registration service struggled under load the Circuit Breaker would regulate registration requests to allow a percentage to succeed, and the Bulkhead would warn the checkout frontend to skip registration for some customers. This approach would reduce the failure cost per unit time to a marketing opportunity cost of £5K a day. The checkout team would not receive an alert, but within minutes their dashboards would highlight registration errors and they could use a Feature Toggle to enable anonymous checkouts for new customers. This would allow them to deploy their connection handling fix within 3 hours with no customer impact. The result would be a 3 hour repair cost of £625, with a sunk cost of £18 and an opportunity cost of £607.

Optimising For Resilience sets the foundation for an organisation to act on market disruption and innovate. Once an organisation has the required level of graceful extensibility, it can continue to invest in its people and technology to achieve sustained adaptability. Sustained adaptability has been described by David Woods as “the ability to adapt to future surprises as conditions continue to evolve“, and can be thought of as innovation capability. An organisation that can quickly adapt to unexpected business events will hold a powerful First Mover Advantage over its competitors.

1 In How Complex Systems Fail, Richard Cook warns that “hindsight bias remains the primary obstacle to accident investigation. There is no such thing as a root cause in a complex production system, nor a blameworthy individual

The Resilience As A Continuous Delivery Enabler series:

  1. The Cost And Theatre Of Optimising For Robustness
  2. Responding To Failure When Optimising For Robustness
  3. The Value Of Optimising For Resilience
  4. Resilience As A Continuous Delivery Enabler

Acknowledgements

This series is indebted to John Allspaw and Dave Snowden for their respective work on Resilience Engineering and Cynefin.

Thanks to Beccy Stafford, Charles Kubicek, Chris O’Dell, Edd Grant, Daniel Mitchell, Martin Jackson, and Thierry de Pauw for their feedback on this series.

When Optimising For Robustness fails

Why is it wrong to assume failures are preventable in IT? Why does optimising for robustness leave organisations ill-equipped to deal with failure, and what are the usual outcomes?

This is part of the Resilience as a Continuous Delivery enabler series:

  1. The cost and theatre of Optimising For Robustness
  2. When Optimising For Robustness fails
  3. The value of Optimising For Resilience
  4. Resilience as a Continuous Delivery enabler

Underinvesting in operability

An organisation that optimises for robustness will attempt to maintain a production environment free from failure. This approach is based on the belief that failures in IT services are caused by isolated, faulty changes that are entirely preventable. A production environment is viewed as a set of homogeneous processes, with predictable interactions occurring in repeatable conditions. This matches the Cynefin definition of a complicated system, in which expert knowledge can be used to predict the cause and effect of events.

Optimising for robustness will inevitably lead to an overinvestment in pre-production risk management, and an underinvestment in production risk management. Symptoms of underinvestment include:

  • Stagnant requirements – “non-functional” requirements are deprioritised for weeks or months at a time
  • Snowflake infrastructure – environments are manually created and maintained in an unreproducible state
  • Inadequate telemetry – logs and metrics are scarce, anomaly detection and alerting are manual, and user analytics lack insights
  • Fragile architecture – services are coupled, service instances are stateful, failures are uncontained, and load vulnerabilities exist
  • Insufficient training – operators are not given the necessary coaching, education, or guidance

This underinvestment creates an inoperable production environment, which makes it difficult for operators to keep IT services in a safe and reliable functioning condition. This will often be deemed acceptable, as production failures are expected to be rare.

The constancy of failure

A production environment of running IT services is not a complicated system. It is an intractable mass of heterogeneous processes, with unpredictable interactions occurring in unrepeatable conditions. It is a complex system of emergent behaviours, in which the cause and effect of an event can only be perceived in retrospect.

As Richard Cook explains in How Complex Systems Fail, “the complexity of these systems makes it impossible for them to run without multiple flaws being present“. A production environment always contains partial faults, and is constantly in a state of near-failure.

A failure will occur when unrelated faults unexpectedly coalesce such that one or more functions cannot succeed. Its revenue cost will be a function of cost per unit time and duration, with cost per unit time the economic impact and duration the time between start and end. Its opportunity costs will come from loss of customer confidence, and increased failure demand slowing feature development.

An organisation optimised for robustness will be ill-equipped to deal with a failure when it does occur. The inoperability of the production environment will produce a brittle incident response:

  • Stagnant requirements and insufficient training will make it difficult to anticipate how services might fail
  • Inadequate telemetry will impede the monitoring of normal versus abnormal operating conditions
  • Snowflake infrastructure and a fragile architecture will prevent a rapid response to failure

For example, at Fruits-U-Like a third party registration service begins to suffer under load. The website rejects new customers on checkout, and a failure begins with a static cost per unit time of £80K per day. A lack of telemetry means the operations team cannot triage for 3 days. After triage an incident is assigned to the checkout team, who improve connection handling within a day. The Change Advisory Board agrees the fix can skip End-To-End Testing, and it is deployed the following day. The failure has a 5 day repair cost of £400K, with a detection sunk cost of £240K and a remediation opportunity cost of £160K.

After a failure, the assumption that failures are caused by individuals will lead to a blame culture. There will be an attitude Sidney Dekker calls the Bad Apple Theory, in which production is considered absolutely reliable bar the actions of a few unreliable employees. The combination of the Bad Apple Theory and hindsight bias will create an oppressive culture of naming, blaming, and shaming the individuals involved. This discourages the sharing of operational knowledge and organisational learnings.

The Dual Value Streams countermeasure

An organisation optimised for robustness will be in a state of Discontinuous Delivery. Attempting to increase the Mean Time Between Failures (MTBF) with practices such as End-To-End Testing will increase feature lead times to the extent that business demand will be unsatisfiable. However, the rules for deploying a production fix will be very different.

When a production fix for a failure is available, people will share a sense of urgency. Regardless of how cost per unit time is estimated, there will be a recognition that a sunk cost has been incurred and an opportunity cost needs to be minimised. There will be a consensus that a different approach is required to avoid long feature lead times.

Dual Value Streams is a common countermeasure to failure when optimising for robustness. For each technology value stream in situ, there will actually be two different value streams. The feature value stream will retain all the advertised pre-production risk management practices, and will take weeks or months to complete. The fix value stream will strip out most if not all pre-production activities, and will take days to complete.

At Fruits-U-Like, that means a 12 week feature value stream from code to production and a 5 day fix value stream from failure start to end 2.

Dual Value Streams signify Discontinuous Delivery, but they also show potential for Continuous Delivery. The fix value stream indicates the lead times that can be accomplished when people have a shared sense of urgency, actively collaborate on releases, and omit the risk management theatre.

1 In The DevOps Handbook by Patrick Debois et al telemetry is defined as a logical grouping of logging, monitoring, anomaly detection, alerting, and user analytics

2 Measuring Continuous Delivery details why deployment failure recovery time should include development time and deployment lead time should not. Deployment failure recovery time is measured from failure start to failure end, while deployment lead time is measured from master commit to production deployment

The Resilience As A Continuous Delivery Enabler series:

  1. The Cost And Theatre Of Optimising For Robustness
  2. Responding To Failure When Optimising For Robustness
  3. The Value Of Optimising For Resilience
  4. Resilience As A Continuous Delivery Enabler

Acknowledgements

This series is indebted to John Allspaw and Dave Snowden for their respective work on Resilience Engineering and Cynefin.

Thanks to Beccy Stafford, Charles Kubicek, Chris O’Dell, Edd Grant, Daniel Mitchell, Martin Jackson, and Thierry de Pauw for their feedback on this series.

The cost and theatre of Optimising For Robustness

Why do so many organisations optimise their IT delivery for robustness? What risk management practices are normally involved, and do their capabilities outweigh their costs?

This is part of the Resilience as a Continuous Delivery enabler series:

  1. The cost and theatre of Optimising For Robustness
  2. When Optimising For Robustness fails
  3. The value of Optimising For Resilience
  4. Resilience as a Continuous Delivery enabler

The tradition of robustness

As software continues to eat the world, organisations must position IT at the heart of their business strategy. The speed of IT delivery needs to be capable of satisfying customer demand, and at the same time the reliability of IT services must be ensured to protect daily business operations. In Practical Reliability Engineering, Patrick O’Connor and Andre Kleyner define reliability as “The probability that [a system] will perform a required function without failure under stated conditions for a stated period of time, or as a function of Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR). When an organisation has unreliable IT services its business operations are left vulnerable to IT outages, and the cost of downtime could prove ruinous if market conditions are unfavourable.

Many organisations have a lack of confidence in their IT services, and an ingrained fear of failure. There is often a simultaneous belief that failures are preventable, based on the assumption that IT services are predictable and failures are caused by isolated changes. In such circumstances an organisation will traditionally Optimise For Robustness. It will focus on maximising the ability of its IT services to “resist change without adapting [their] initial stable configuration, by implicitly favouring a higher MTBF over a lower MTTR. It will use robustness-centric risk management practices in its technology value streams to reduce the risk of future failures, such as 1:

  • End-To-End Testing to verify the functionality of a new service version against its unowned dependent services
  • Change Advisory Boards to assess, prioritise, and approve the deployment of new service versions
  • Change Freezes to restrict the deployment of new service versions for a period of time derived from market conditions

Consider a fictional Fruits-U-Like organisation, with development teams working to 2 week iterations and a quarterly release cycle. Fruits-U-Like has optimised itself for robustness ever since a 24 hour website outage 5 years ago. Each release goes through 6 weeks of End-To-End Testing with the testing team, a 2 week Change Advisory Board, and 1 week of preparation with the operations team. There are also several 4 week Change Freezes throughout the year, to coincide with marketing campaigns.

The costs and theatre of robustness

Robustness is a desirable capability of an IT service, but optimising for robustness invariably means spending too much time for too little risk reduction. The risk management practices used will be far more costly and less valuable than expected:

If the next Fruits-U-Like release was estimated to be worth £50K per day in new revenue, the 12 week lead time would create a total opportunity cost of £4.2 million. This would include the handover delays between the development, testing, and operations teams due to misaligned priorities. If a Change Freeze delayed the deployment by another 4 weeks the opportunity cost would increase to £5.6 million.

These risk management practices are what Jez Humble calls Risk Management Theatre. They are based on the misguided assumption that preventative controls on everyone will prevent anyone from making a mistake. Furthermore, they actually increase risk by ensuring a large batch size and a sizeable amount of requirements/technology changes per service version 2. They impede knowledge sharing, restrict situational awareness, create enormous opportunity costs, and doom organisations to a state of Discontinuous Delivery.

1 Other practices include manual regression testing, segregation of duties, and uptime incentives for operators

2 The Principles of Product Development Flow by Don Reinertsen describes in detail how large batch sizes increase risk

The Resilience As A Continuous Delivery Enabler series:

  1. The Cost And Theatre Of Optimising For Robustness
  2. When Optimising For Robustness Fails
  3. The Value Of Optimising For Resilience
  4. Resilience As A Continuous Delivery Enabler

Acknowledgements

This series is indebted to John Allspaw and Dave Snowden for their respective work on Resilience Engineering and Cynefin.

Thanks to Beccy Stafford, Charles Kubicek, Chris O’Dell, Edd Grant, Daniel Mitchell, Martin Jackson, and Thierry de Pauw for their feedback on this series.

Discontinuous Delivery

Continuous Delivery is a set of principles and practices to improve the stability and throughput of a release process. But what does it mean to be practising Continuous Delivery? What comes beforehand, what comes afterwards, and how many deploys a day do you actually need?

Measuring Continuous Delivery describes how to guide the adoption of Continuous Delivery, using stability and throughput measurements. The book introduces a new term into the lexicon of Continuous Delivery – Discontinuous Delivery.

Discontinuous Delivery is when an organisation has a release process that lacks the stability and speed required to satisfy business demand

An organisation that cannot release product increments sufficiently reliably or quickly for its customers is in a state of Discontinuous Delivery. By applying the principles and practices of Continuous Delivery to its unique circumstances and constraints, an organisation can continuously improve the stability and throughput of its release process until it is in a state of Continuous Delivery.

The definition of Discontinuous Delivery leads to some interesting conclusions:

  1. Business demand must be understood before success criteria for Continuous Delivery can be defined
  2. Continuous Delivery does not ask for a fixed amount of deploys per unit time – 3 deploys a day might be too slow, 1 deploy a month might be too fast
  3. It is possible to move from Discontinuous Delivery to Continuous Delivery and vice versa multiple times, depending on market conditions

Measuring Continuous Delivery contains more detailed information on Discontinuous Delivery, and how to use the Improvement Kata within the context of an organisation to successfully adopt Continuous Delivery principles and practices.

Aim for Operability, not DevOps as a Cult

The DevOps Handbook describes an admirable DevOps as a Philosophy based on flow, feedback, continual learning and experimentation. However, a near-decade of naivety, confusion, and profiteering surrounding DevOps has left the IT industry with DevOps as a Cult, and the benefits of Operability are all too often overlooked.

Why is DevOps as a Philosophy is a laudable ideal? Why is DevOps as a Cult the unpleasant reality? Why should organisations instead focus on Operability as an enabler of Continuous Delivery?

Introduction

When an IT organisation has separate Development and Operations departments it will inevitably suffer from a serious conflict of interest. The Development teams will be told to keep pace with the market and incentivised by features, while the Operations teams will be told to provide reliability and incentivised by uptime. This creates a troubled relationship in which one party tries to maximise production changes, and the other tries to minimise them.

This conflict of interest has a devastating impact on the stability, throughput, and quality of IT services. It produces unstable, unreliable, and insecure services vulnerable to costly outages. It ensures production changes are delayed by days, weeks, or even months due to endless coordination between teams, convoluted change approvals, and fear of failure. It results in significant amounts of functional and operational rework, and constant firefighting just to keep systems up and running. It means the organisation loses out in the marketplace, due to the high opportunity costs incurred and the high attrition rate of employees.

DevOps As A Philosophy

In 2008, Patrick Debois and Andrew Schafer discussed at Agile 2008 the application of Agile practices to infrastructure. In 2009, John Allspaw and Paul Hammond shared with Velocity 2009 their famous “10 Deploys per Day: Dev and Ops Cooperation at Flickr” story, and Patrick Debois subsequently created the first DevOpsDays conference. The DevOps philosophy of collaboration between Development and Operations had begun.

In 2016, the DevOps Handbook was published by Gene Kim, Jez Humble, Patrick Debois, and John Willis. The DevOps Handbook builds on the Phoenix Project novel by Gene Kim, Kevin Behr, and George Spafford in 2013, and it describes how the Three Ways of DevOps can help organisations to succeed:

  • The First Way: The Principles of Flow – create a continuous flow of value-add from Development to Operations
  • The Second Way: The Principles of Feedback – create a constant flow of feedback from Operations to Development
  • The Third Way: The Principles of Continual Learning and Experimentation – create a culture of ever-increasing knowledge within Development and Operations

The DevOps Handbook advocates long-lived product teams frequently deploying changes during normal business hours, using ubiquitous monitoring to quickly resolve errors, and building a shared culture of Continuous Improvement. It is a seminal work that describes what DevOps should beDevOps As A Philosophy.

DevOps As A Cult

Unfortunately, there were 7 years between the creation of the DevOps meme and the publication of The DevOps Handbook. In the meantime, a different kind of DevOps has emerged that is entirely distinct from DevOps As A Philosophy yet regrettably popular within the IT industry. This bastardisation of DevOps is a cult based on confusion, naivety, and profiteering.

There has been a great deal of confusion about what DevOps actually is, and many organisations have unwittingly increased their disorder by attempting to adopt DevOps without understanding it. For example, there is now the notion of a DevOps Engineer, in which DevOps is equated with a Infrastructure As Code specialist and any need for further change is ignored. Another example is the DevOps Team, in which a team of DevOps Engineers or similar is inserted between Development and Operations teams and becomes yet another delivery impediment. As Jez Humble has remarked, “creating another functional silo that sits between Dev and Ops is clearly a poor (and ironic) way to try and solve these problems“.

Many people have naively latched onto DevOps via misinformation and with little appreciation of their organisational complexity and context. In a complex, adaptive system every individual has limited information, the cause and effect of an event cannot be predicted, and the system must be probed for insights. One common error is to literally assume the conflict of interest between Development and Operations is always the key constraint, when other emerging conflicts can be equally ruinous such as between separate Development and Testing departments. Another error is to assume large enterprise organisations need some kind of Enterprise DevOps roadmap, despite the ineffectuality of blueprints in a complex system and Dave Roberts pointing out “flow and continuous improvement are equally applicable to a large enterprise as they are to an agile web startup“.

Finally, the lack of clarity on DevOps has led to unabashed profiteering from some recruitment firms and vendors. This can be seen when recruitment firms rebrand sysadmins as DevOps Engineers, or when vendors market their automation tools as DevOps tools. DevOps certification has even been launched by the DevOps Institute, which sells one interpretation of a complex cultural movement and of which Sam Newman complained “aside from perhaps three practitioners, the rest of the group are either professional trainers or sales and marketing people“.

Many organisations that have attempted to adopt DevOps still suffer from short-lived project teams infrequently deploying changes out of business hours, manual regression testing without telemetry, and an antagonistic culture with minimal knowledge sharing. The application of confusion, naivety, and profiteering to the DevOps meme has resulted in what DevOps should not be – DevOps As A Cult.

Aim for Operability, not DevOps As A Cult

The rise of DevOps coincided with the rise of Continuous Delivery, which is explicitly focussed on the improvement of IT stability and throughput to satisfy business demand. Continuous Delivery does not need DevOps As A Philosophy but they can be thought of as complementary, due to their shared emphasis on fast feedback loops, cultural change, and task automation. DevOps As A Cult has no such standing, as shown by Dave Farley stating that “DevOps rarely says enough about the goal of delivering valuable software… this is no place for cargo-cultism“.

Continuous Delivery requires operational excellence to be built into organisations. If a service is unstable, a high level of throughput is impossible to sustain as the rework incurred during periods of instability will restrict the delivery of new features. This means Operability is of critical importance to Continuous Delivery, as throughput is dependent upon the ability of the organisation to maintain safe and reliable systems according to its operational requirements.

Both Continuous Delivery and DevOps As A Philosophy advocate the following operational practices to improve Operability:

  • Prioritisation of operational requirements – plan and prioritise work on configuration, infrastructure, performance, security, etc. alongside new features
  • Automated infrastructure – automate production infrastructure and build a self-service provisioning capability for on-demand pre-production environments
  • Deployment health checks – incorporate system health checks and functional smoke tests into pre-production and production deployments
  • Pervasive telemetry – establish a logging/monitoring platform for the aggregation, visualisation, anomaly detection, and alerting of business-level, application-level, and operational-level events
  • Failure injection – introduce simulated errors under controlled conditions into production systems, and rehearse incident response scenarios
  • Incident swarming – encourage people to work together to identify and resolve production incidents as soon as they occur
  • Blameless post-mortems – hold post-incident reviews to understand the context, cause and effect, and remediation of a production incident, and propose countermeasures for the future
  • Shared on-call responsibilities – ensure all team members are on rotation for production incidents, and empowered to handle incidents when they occur

Teams need to adopt a “You Build It, You Run It” culture, in which everyone contributes to operational practices and everyone is responsible for Operability. This means teams will need guidance on how to build, deploy, and run services plus how to create the operational toolchain to support those services. For this reason operability engineers should be embedded into teams, to share their expertise on the delivery of operational requirements and coach other team members on architecting for resilience, establishing a telemetry platform, adopting a mindset of operational excellence, etc. If there are more teams than available operability engineers then every team should have an operability engineer assigned in a liaison role.

Conclusion

In many organisations the conflict of interest between Development and Operations is enormously damaging, and DevOps As A Philosophy as described in the DevOps Handbook is an admirable model for improving organisations via fast flow, fast feedback, and a culture of learning and experimentation. However, the confusion, naivety, and profiteering surrounding DevOps has led to DevOps As A Cult within the IT industry, and unfortunately its popularity is matched only by its inability to improve organisations.

An organisation that wishes to improve its time to market should adopt Continuous Delivery and aim for Operability. That means operability engineers working on teams to teach others how to adopt an operational mindset and build the necessary tools. Continuous Delivery needs operability, and by achieving operational excellence an organisation can improve its throughput and obtain a strategic competitive advantage in the marketplace.

Thanks to Beccy Stafford, Charles Kubicek, Chris O’ Dell, Edd Grant, John Clapham, and Martin Jackson for their feedback

« Older posts Newer posts »

© 2025 Steve Smith

Theme by Anders NorénUp ↑