On Tech

Author: Steve Smith (Page 8 of 10)

Continuous Delivery and Cost of Delay

Use Cost of Delay to value Continuous Delivery features

When building a Continuous Delivery pipeline, we want to value and prioritise our backlog of planned features to maximise our return on investment. The time-honoured, ineffective IT approach of valuation by intuition and prioritisation by cost is particularly ill-suited to Continuous Delivery, due to its focus upon one-off infrastructure improvements to enable product flow. How can we value and prioritise our backlog of planned pipeline features to maximise economic benefits?

To value our backlog, we can calculate the Cost of Delay of each feature – its economic value over a period of time if it was immediately available. Described by Don Reinertsen as “the golden key that unlocks many doors“, Cost of Delay can be calculated by quantifying the value of change or the cost of the status quo via the following economic benefit types:

  • Increase Revenue – improve profit margin
  • Protect Revenue – sustain profit margin
  • Reduce Costs – reduce costs currently incurred
  • Avoid Costs – reduce costs potentially incurred

Cost of Delay allows us to quantify the opportunity cost between a feature being available now or later, and using money as the unit of measurement transforms stakeholder conversations from cost-cutting to delivering value. Calculation accuracy is less important than the process of collaborative information discovery, with assumptions and probabilities preferably co-owned by stakeholders and published via information radiator.

Cost of Delay = economic value over time if immediately available

To prioritise our backlog, we can use Cost of Delay Divided By Duration (CD3) – a variant of the Weighted Shortest Job First scheduling policy. With CD3 we divide Cost of Delay by duration, with a higher score resulting in a higher priority. This is an effective scheduling policy as the duration denominator promotes batch size reduction.

CD3 = Cost of Delay / Duration

As the goal of Continuous Delivery is to decrease cycle time by reducing the transaction cost of releasing software, a pipeline feature will likely yield an Avoid Cost or Reduce Cost benefit intrinsically linked to release cadence. We can therefore calculate the Cost of Delay as one of the below:

  1. Reduce Cost: Automate action(s) to decrease wait times within release processing time

    = (wait time in minutes / cycle time in days) * minute price in £

  2. Avoid Cost: Automate action(s) to decrease probability of repeating release processing time due to rework

    = (processing time in minutes / cycle time in days) * minute price in £ * % cost probability per year

For example, consider an organisation building a Continuous Delivery pipeline to support its Apples, Bananas, and Oranges applications by fully automating its release scripts. The rate of business change is variable, with an Apples cycle time of 1 month, a Bananas cycle time of 2 months, and an Oranges cycle time of 3 months. Our pipeline has already fully automated the deploy, stop, and start actions for our Apples and Bananas applications but lacks support for our Oranges application, our test framework, and our database migrator.
Application Estate Once our development team have provided their cost estimates, how do we determine which feature to implement next without resorting to intuition?

Backlog Duration We begin by agreeing with our pipeline stakeholders an arbitrary price for a minute of our time of £10000, and calculate the Cost of Delay for supporting the Oranges application as:
Support Oranges application

= (wait time / cycle time) * minute price
= (20 + 20 + 20 / 90) * 10000
= 0.67 * 10000
= £6700 per day

Given the test framework has failed twice in the past year and caused a repeat of release processing time specifically due to its lack of pipeline support, the Cost of Delay is:
Support test framework

= (100 / months in a year) * occurrences
= (100 / 12) * 2
= 16% cost probability per year

= (processing time / cycle time) * minute price * % cost probability
= ((100 / 30) + (100 / 60) + (160 / 90)) * 10000 * 16%
= 6.78 * 10000 * 16%
= £10848 per day (£5328 Apples, £2672 Bananas, £2848 Oranges)

The Cost of Delay for supporting the database migrator is:

Support database migrator

= (wait time / cycle time) * minute price
= ((45 / 30) + (45 / 60) + (45 / 90)) * 10000
= 2.75 * 10000
= £27500 per day (£15000 Apples, £7500 Bananas, £5000 Oranges)

Now that we have established the value of the planned pipeline features, we can use CD3 to produce an optimal work queue. CD3 confirms that support for the database migrator is our most urgent priority:

Backlog CD3

This example shows that using Cost of Delay and CD3 within Continuous Delivery validates Mary Poppendieck’s argument that “basing development decisions on economic models helps the development team make good tradeoff decisions“. As well as learning support for the database migrator is twice as valuable as any current alternative, we can offer new options to our pipeline stakeholders – for example, if an Apples-specific database migrator required only 5 days, it would become our most desirable feature (£15000 per day / 5 days = CD3 score of 3000).

No Projects

Projects kill flow and teams. Focus on products, not projects

Since the Dawn of Computer Time, enormous sums of money and embarrassing amounts of time have been squandered upon software projects that have delivered little or no return on investment, with projects floundering between segregated Business and IT divisions squabbling over overestimated value-add and underestimated delivery dates. Given Grant Rule’s assertion that “studies too numerous to mention show that software projects are challenged or fail“, why are software projects so prone to failure and why do they persist?

To answer these questions, we must understand what constitutes a software project and why its delivery model is incongruent with product development. If we start with the PRINCE 2 project definition of “a temporary organization that is needed to produce a unique and predefined outcome or result at a pre-specified time using predetermined resources“, we can offer a concise definition as follows:

A project is a fixed amount of time and money assigned to deliver value-add

The key characteristic of a software project appears to be its fixed end date, which as a delivery model has been repeatedly debunked by IT practitioners such as Allan Kelly denouncing “endless, pointless discussions about when it will be done… successful software doesn’t have a pre-specified end date” and Marc Lankhorst arguing that “over 80% of IT spending in large organisations is on maintenance“. However, the fixed end date of a software project is invariably a consequence of its requirement for a collection of value-adding features to be simultaneously delivered, suggesting an augmented definition of:

A project is a fixed amount of time and money assigned to deliver a large batch of value-add

Once we view software projects as large batches of value-add, we can apply The Principles Of Product Development Flow by Don Reinertsen and better understand why so many projects fail:

  1. Increased cycle time – a project might not be deliverable on a particular date unless either demand is throttled or capacity is increased, e.g. artifically reduce user demand or increase staffing levels
  2. Increased variability – a project might be delayed due to unpredictable blockages in the value stream, e.g. testing of features B and C blocked while testing of feature A takes longer than expected
  3. Increased feedback delays – a project might incur significant costs due to slow feedback on bad design decisions and/or defects increasing rework, e.g. failures in feature C not detected until features A and B have passed testing
  4. Increased risk – a project might have an increased probability and cost of failure due to increased requirements/technology change, increased variation, and increased feedback delays
  5. Increased overheads  – a project might endure development inefficiencies due to increased requirements/technology change, e.g. feature C development time increased by need to understand complexity of features A and B
  6. Increased inefficiencies – a project might encounter increased transaction costs due to increased requirements/technology change e.g. feature A slow to release as features B and C also required for release
  7. Increased irresponsibility – a project might suffer from diluted responsibilities, e.g. staff member has responsibility for delivery of feature A but is unincentivised to participate in delivery of features B or C

Don also provides a compelling explanation as to why the project delivery model remains prevalent, by explaining how large batches can become institutionalised as they “appear to have scale economies that increase efficiency [and] appear to reduce variability“. Software projects might indeed appear to be efficient due to perceived value stream inefficiencies and the counter-intuitiveness of batch size reduction, but from a product development standpoint it is an inefficient, ineffective delivery model that impedes value, quality, and flow.

There is a compelling alternative to the project delivery model – product development flow, in which we apply economic theory to Lean product development practices in order to flow product designs through our organisation. Product development flow emphasises the benefits of batch size reduction and encourages a one piece continuous flow delivery model, in order to reduce costs and improve return on investment.

Discarding the project delivery model in favour of product development flow requires an entirely different mindset, as epitomised by Grant urging us to “accommodate the ideas of flow production and lean systems thinking” and Allan affirming that “BAU isn’t a dirty word… enhancing products is Business As Usual, we should be proud of that“. On that basis the No Projects movement was conceived by Joshua Arnold to promote the valuation of products over projects, and anointed as:

Projects kill flow and teams. Focus on products, not projects

Application antipattern: Serialisation

Serialisation increases batch size and cycle time

When designing applications for Continuous Delivery, our goal is to grow an architecture that minimises batch size and facilitates a low cycle time. However, architectural decisions are often local optimisations that value efficiency over effectiveness and compromise our ability to rapidly release software, and a good example is the use of object serialisation and pseudo-serialisation between consumer/producer applications.

Object serialisation occurs when the producer implementation of an API is serialised across the wire and reused by the consumer application. This approach is promoted by binary web services such as Hessian.

Object Serialisation

Pseudo-serialisation occurs when the producer implementation of an abstraction encapsulating the API is reused by the consumer application. This approach often involves auto-generating code from a schema and is promoted by tools such as JAXB and WSDL Binding.

Pseudo Serialisation

Both object serialisation and pseudo-serialisation impede quality by creating a consumer/producer binary dependency that significantly increases the probability of runtime communication failures. When a consumer is dependent upon a producer implementation of an API, even a minor syntax change in the producer can cause runtime incompatibilities with the unchanged consumer. As observed by Ian Cartwright, serialising objects over the wire means “we’ve coupled our components together as tightly as if we’d just done RPC“.

A common solution to combat this increased risk of failure is to couple consumer/producer versioning, so that both applications are always released at the same version and at the same point in time. This strategy is enormously detrimental to Continuous Delivery as it inflates batch size and cycle time, with larger change sets per release resulting in an increased transaction cost, an increased risk of release failure, and an increased potential for undesirable behaviours.

Producer Consumer Versions

For example, when a feature is in development and our counterpart application is unchanged it must still be released simultaneously. This overproduction of application artifacts increases the amount of inventory waste in our value stream.

Wasteful Versions

Alternatively, when a feature is in development and our counterpart application is also in development, the release of our feature will be blocked until the counterpart is ready. This delays customer feedback and increases our holding costs, which could have a considerable economic impact if our new feature is expected to drive revenue growth.

Blocked Versions

The solution to this antipattern is to understand that an API is a contract not an object, and document-centric messaging is consequently a far more effective method of continuously delivering distributed applications. By communicating context-neutral documents between consumer and producer, we eliminate shared code artifacts and allow our applications to be released independently.

While document-centric messaging reduces the risk of runtime incompatibilities, a new producer version could still introduce an API change that would adversely affect one or more consumers. We can protect consumer applications by implementing the Tolerant Reader pattern and leniently parsing a minimal amount of information from the API, but the producer remains unaware of consumer usage patterns and as a result any incompatibility will remain undetected until integration testing at the earliest.

A more holistic approach is the use of Consumer Driven Contracts, where each consumer supplies the producer with a testable specification defining its expectations of a conversation. Each contract self-documents consumer/producer interactions and can be plugged into the producer commit build to assert it remains unaffected by different producer versions. When a change in the producer codebase introduces an API incompatibility, it can be identified and assessed for consumer impact before the new producer version is even created.

By using document-centric messaging and Consumer Driven Contracts, we can continuously deliver distributed applications with a low batch size and a correspondingly low cycle time. The impact of architectural decisions upon Continuous Delivery should not be under-estimated.

Release more with less

Continuous Delivery enables batch size reduction

Continuous Delivery aims to overcome the large delivery costs traditionally associated with releasing software, and in The Principles of Product Development Flow Don Reinertsen describes delivery cost as a function of transaction cost and holding cost. While transaction costs are incurred by releasing a product increment, holding costs are incurred by not releasing a product increment and are proportional to batch size – the quantity of in flight value-adding features, and the unit of work within a value stream.

Economic Batch Size [Reinertsen]

The above graph shows that a reduction in transaction cost alone will not dramatically impact delivery cost without a corresponding reduction in batch size, and this mirrors our assertion that automation alone cannot improve cycle time. However, Don also states that “the primary controllable factor that enables small batches is low transaction cost per batch“, and by implementing Continuous Delivery we can minimise transaction costs and subsequently release smaller change sets more frequently, obtaining the following benefits:

  1. Improved cycle time – smaller change sets reduce queued work (e.g. pending deployments), and due to Little’s Law cycle time is decreased without constraining demand (e.g. fewer deployments) or increasing capacity (e.g. more deployment staff)
  2. Improved flow – smaller change sets reduce the probability of unpredictable, costly value stream blockages (e.g. multiple deployments awaiting signoff)
  3. Improved feedback – smaller change sets shrink customer feedback loops, enabling product development to be guided by Validated Learning (e.g. measure revenue impact of new user interface)
  4. Improved risk – smaller change sets reduce the quantity of modified code in each release, decreasing both defect probability (i.e. less code to misbehave) and defect cost (i.e. less complex code to debug and fix)
  5. Improved overheads – smaller change sets reduce transaction costs by encouraging optimisations, with more frequent releases necessitating faster tooling (e.g. multi-core processors for Continuous Integration) and streamlined processes (e.g. enterprise-grade test automation)
  6. Improved efficiency – smaller change sets reduce waste by narrowing defect feedback loops, and decreasing the probability of defective code harming value-adding features (e.g. user interface change dependent upon defective API call)
  7. Improved ownership – smaller change sets reduce the diluted sense of responsibility in large releases, increasing emotional investment by limiting change owners (e.g. single developer responsible for change set, feedback in days not weeks)

Despite the business-facing value proposition of Continuous Delivery, there may be no incentive from the business team to increase release cadence. However, the benefits of releasing smaller change sets more frequently – improved feedback, risk, overheards, efficiency, and ownership – are also operationally advantageous, and this should be viewed as an opportunity to educate those unaware of the power of batch size reduction. Such a scenario is similar to the growth of Continuous Integration a decade ago, when the operational benefits of frequently integrating smaller code changes overcame the lack of business incentive to increase the rate of source code integration.

Business requirements = minimum release cadence
Operational requirements = maximum release cadence

A persistent problem with increasing release cadence is Eric Ries’ assertion that “the benefits of small batches are counter-intuitive“, and in organisations long accustomed to a high delivery cost it seems only natural to artificially constrain demand or increase capacity to smooth the value stream. For example, our organisation has a 28 day cycle time of which 7 days are earmarked for release testing. In this situation, decreasing cadence to a 36 day cycle time appears less costly than increasing cadence to a 14 day cycle time, as release testing will ostensibly decrease to 19% of our cycle time rather than increase to 50%. However, this ignores both the holding cost of constraining demand and the long-unimplemented optimisations we would be compelled to introduce to achieve a lower release cadence (e.g. increased level of test automation).

Improving cycle time is not just about using Continuous Delivery to reduce transaction costs – we must also be courageous, and release more with less.

Build Continuous Delivery in

Building Continuous Delivery into an organisation requires radical change

While Continuous Delivery has a well-defined value proposition and a seminal book on how to implement a deployment pipeline, there is a dearth of information on how to transform an organisation for Continuous Delivery. Despite its culture-focussed principles and an adoption process described by Jez Humble as organisational-focussed rather than tools-centric”, many Continuous Delivery initiatives fail to emphasise an organisational model in which software is always releasable. This contravenes Lean Thinking and the Deming 95/5 Rule – that 95% of problems are attributable to system faults, while only 5% are due to special causes of variation. Building an automated deployment pipeline can eliminate the 5% of special causes of variation in our value stream (e.g. release failures), but it cannot address the remaining 95% of problems caused by our organisation structure (e.g. wait times between silos). From this we can infer that:

Continuous Delivery = 95% organisation, 5% automation

Establishing a Continuous Delivery culture requires a change management programme more challenging, time-consuming, and valuable than any technology-based efforts. Donella Meadows recommended that to effect change we “arrange the structures and conditions to reduce the probability of destructive behaviours and to encourage the possibility of beneficial ones“, and we can achieve this by using the change patterns of Linda Rising and Mary Lynn Adams within the change management supermodel of Jurgen Appelo:

  • Dance with the System
  • Mind the People
  • Stimulate the Network
  • Change the Environment

To dance with the system, we propose a made to order Continuous Delivery programme, with a tailor made business case that emphasises reduced transaction costs and/or increased customer value according to the needs of our organisation. We must identify a Local Sponsor to support our efforts and a Corporate Angel to increase awareness, and we should communicate successful case studies to our stakeholders as External Validation.

To mind the people, we construct a collaborative, bottom-up change programme that encourages participation. We need to Involve Everyone from the outset, and apply a Personal Touch with each individual stakeholder to pitch Continuous Delivery in terms of their incentives. We should use Corridor Politics to promote our change initiative, Just Say Thanks to our contributors, and highlight value stream waste without dispute – as Morgan Wootten said “a lighthouse doesn’t blow a horn, it shines a light“.

To stimulate the network, we emulate the Diffusion of Innovations theory of Everett Rogers and exploit the social network that comprises our organisation. We must encourage Innovators to spark an interest in our change initiative, and then form a group of Early Adopters to offer us early feedback. We need to Ask For Help from Connectors to evangelise to their peers on our behalf, and by Staying In Touch with our supporters we can work towards an Early Majority invested in Continuous Delivery.

To change the environment, we focus upon changing our organisation structure and processes to instil a culture of Continuous Delivery. We need to radiate our value stream In Your Space to raise awareness of cycle time, lead times, and wait times using Just Enough repackaged Lean terminology (e.g. “average time to market” instead of cycle time). We must work as Bridge Builders between different siloed teams to reduce our communications burden, and we should develop our pipeline Step By Step to encourage the good practices and discourage the bad (e.g. enforcing decouple deployment from release in a user interface).

Building Continuous Delivery into an organisation can be achieved by automating a deployment pipeline and implementing a change management programme, but we should remember Jurgen Appelo’s advice that changing people “is hard to do without an expensive operating table“. Our change programme must be tailored to business requirements, personalised for each stakeholder, and focussed upon improving the environment – and we should always remember:

Building a Continuous Delivery pipeline is easy. Building a Continuous Delivery organisation is hard

Continuous Delivery != DevOps

Continuous Delivery and DevOps are interdependent, not equivalent

Since the publication of Dave Farley and Jez Humble’s seminal book on Continuous Delivery in 2010, its rise within the IT industry has been paralleled by the growth of the DevOps movement. While Continuous Delivery has an explicit goal of optimising for cycle time and an established set of principles and practices, DevOps is a more organic philosophy that is defined as “aligning development and operations roles and processes in the context of shared business objectives“, and gradually codifying into principles and practices. Continuous Delivery and DevOps possess a shared background in agile methods and Lean Thinking, and a shared desire to eliminate Waterscrumfall silos – but what is the nature of their relationship?

In Continuous Delivery, practitioners such as Jez Humble have warned that organisations require “a culture that enables collaboration and understanding between the functional groups that deliver IT services“, which refers to the culture-centric principles – Continuous Improvement, Done Means Released, and Everybody Is Responsible – that reduce handover delays between siloed teams. DevOps provides an implementation strategy for these principles – its emphasis upon “the integration of Agile principles with Operations practices” aligns Development and Operations working practices and encourages cooperation. However, these principles can be also implemented independently of DevOps – for example, an organisation might forego a QA team in favour of mandatory Development support for production releases, as at Facebook.

In DevOps, one of the four key areas described by Patrick Debois is Extend Delivery To Production. The intention is for the delivery mechanism to act as a focal point for collaboration between Development and Operations, resulting in improved speed/reliability of releases and a sense of shared responsibility for production systems. Continuous Delivery offers an implementation strategy for this key area – a deployment pipeline provides a shared one-button workflow, encourages the emergence of a shared codebase and toolchain, and facilitates a release cadence that minimises change sets and the risk of failure. However, it should be noted that Extend Delivery To Production could be accomplished without Continuous Delivery – for example, a push-based Continuous Deployment mechanism might underpin the value stream instead of a pull-based pipeline, as at IMVU.

From the above we can surmise that Continuous Delivery and DevOps are interdependent, but the inherent fuzziness of the DevOps philosophy allows different interpretations of the relationship. For example, Jeff Sussna recently contended that “delivering software as service makes operations an explicit part of the customer value proposition… customers view functionality and operability as inseparable aspects of service” and that by defining DevOps “not in terms of how IT structures itself, but rather in terms of what customers expect” we can say “DevOps IS Continuous Delivery“. While it is an interesting approach to couple DevOps to customer expectations, the commonly accepted definitions focus upon internal organisational change in order to meet business objectives, which may or may not include operability as a first-class concept. It is evident that SaaS customers will have explicit operability requirements, but for many organisations the reality is that customers explicitly expect functionality and timeliness while implicitly expecting operability. For example, Jeff uses a restaurant review metaphor to describe the combined value of functionality and operability (“the food was great but the service was terrible“), but restaurant customers cannot observe back-of-house operability and will likely only comment upon front-of-house operability if it impacts upon functionality and/or timeliness.

Jeff also makes a comparison of nomenclature, suggesting that for agile development and Continuous Delivery the name describes the value… in the case of DevOps, the name describes the implementation, not the desired outcome“. Surely the desired outcome of DevOps is expressed in the portmanteau – Development and Operations teams seamlessly working together to deliver value-adding features to the customer.

Optimal cycle time strategy

How should you try to optimise cycle time from idea to customer? How can you optimise accessible constraints, and radiate the inaccessible?

The goal of Continuous Delivery is to optimise for cycle time, so that we can reduce lost opportunity costs and improve our time-to-market. However, how do we construct a cycle time strategy, and how might it be implemented without a comprehensive change mandate? A study of Continuous Delivery experience reports and Lean Thinking suggests some common impediments to optimising cycle time:

  1. Excessive rework
  2. Long lead times
  3. Incongruent organisation structure

From the above we can therefore form an ideal cycle time strategy:

Optimise cycle time = optimise product integrity + optimise lead times + optimise organisation

Optimising product integrity is essential as rework has a pernicious influence upon delivery cadence, highlighted by David Anderson stating that “unplanned rework due to bugs lengthens lead times… and greatly reduces throughput“. By using practices such as Acceptance Test Driven Development and root cause analysis as well as applying Continuous Delivery principles such as Build Quality In and Repeatable Reliable Process, we can trim our defect waste and gradually remove rework from the value stream.

Optimising lead times encourages us to recognise that unreleased product increments are valueless inventory, and that we should accelerate our pathway to production until we obtain a First Mover Advantage over our competitors. By introducing Work In Progress limits to reduce batch sizes and employing the Continuous Delivery principles of Automate Almost Everything and Bring Pain Forward, we can curtail our inventory waste and deliver value-adding features to our customers faster.

Optimising an organisation offers both the greatest challenge and the greatest potential for cycle time optimisations, particularly in siloed organisations. Despite being described by Jez Humble as a “response to the historical expense of computing resources and the high transaction cost of putting out a release [that results in] lower software quality, lower production stability, and less frequent releases“, it remains a prevalent model despite its inherent coordination costs. By restructuring our organisation into product-centric, cross-functional teams and instilling the Continuous Delivery principles of Everybody Is Responsible and Continuous Improvement, we can eliminate our wait waste and obtain a significant cycle time reduction.

At the outset of our Continuous Delivery programme, a value stream mapping and analysis of product defects will likely indicate our expected cycle time impediments, and we should present these findings to our stakeholders along with our ideal cycle time optimisation strategy. However, the ambitious scope of our strategy means that without executive sponsorship our change mandate is unlikely to extend to such radical notions as establishing cross-functional teams. In this situation we should use the confines of our mandate to derive an organisation-specific optimal cycle time strategy:

Optimise cycle time = optimise product integrity + optimise lead times + optimise organisation

Rather than being discouraged by the limitations of our mandate, we can use it to guide our optimisation efforts according to constraint accessibility. If we cannot optimise the organisation, we optimise lead times. If we cannot optimise lead times, we optimise product integrity. After each successful change is implemented, we communicate to our stakeholders both the net gain in cycle time and the larger, inaccessible potential improvements:

Optimise the accessible, radiate the inaccessible

In this manner we can gradually build confidence in our Continuous Delivery programme, until our change mandate is broadened to encompass the comprehensive change required to dramatically improve both our cycle time and our product revenues.

The Strangler Pipeline – Autonomation

The Strangler Pipeline is grounded in autonomation

Previous entries in the Strangler Pipeline series:

  1. The Strangler Pipeline – Introduction
  2. The Strangler Pipeline – Challenges
  3. The Strangler Pipeline – Scaling Up
  4. The Strangler Pipeline – Legacy and Greenfield

The introduction of Continuous Delivery to an organisation is an exciting opportunity for Development and Operations to Automate Almost Everything into a Repeatable Reliable Process, and at Sky Network Services we aspired to emulate organisations such as LMAX, Springer, and 7Digital by building a fully automated Continuous Delivery pipeline to manage our Landline Fulfilment and Network Management platforms. We began by identifying our Development and Operations stakeholders, and establishing a business-facing programme to automate our value stream. We emphasised to our stakeholders that automation was only a step towards our end goal of improving upon our cycle time of 26 days, and that the Theory Of Constraints warns that automating the wrong constraint will have little or no impact upon cycle time.

Our determination to value cycle time optimisation above automation in the Strangler Pipeline was soon justified by the influx of new business projects. The unprecedented growth in our application estate led to a new goal of retaining our existing cycle time while integrating our greenfield application platforms, and as our core business domain is telecommunications not Continuous Delivery we concluded that fully automating our pipeline would not be cost-effective. By following Jez Humble and Dave Farley’s advice to “optimise globally, not locally”, we focussed pipeline stakeholder meetings upon value stream constraints and successfully moved to an autonomation model aimed at stakeholder-driven optimisations.

Described by Taiichi Ohno as one of “the two pillars of the Toyota Production System“, autonomation is defined as automation with a human touch. It refers to the combination of human intelligence and automation where full automation is considered uneconomical. While the most prominent example of autonomation is problem detection at Toyota, we have applied autonomation within the Strangler Pipeline as follows:

  • Commit stage. While automating the creation of an aggregate artifact when a constituent application artifact is committed would reduce the processing time of platform creation, it would have zero impact upon cycle time and would replace Operations responsibility for release versioning with arbitrary build numbers. Instead the Development teams are empowered to track application compatibilities and create aggregate binaries via a user interface, with application versions selectable in picklists and aggregate version numbers auto-completed in order to reduce errors.
  • Failure detection and resolution. Although creating an automated rollback or self-healing releases would harden the Strangler Pipeline, we agreed that such a solution was not a constraint upon cycle time and would be costly to implement. When a pipeline failure occurs it is recorded in the metadata of the application artifact, and we Stop The Line to prevent further use until a human has logged onto the relevant server(s) to diagnose and correct the problem.
  • Pipeline updates. Although the high frequency of Strangler Pipeline updates implies value in further automation of its own Production release process, a single pipeline update cannot improve cycle time and we wish to retain scheduling flexibility –  as pipeline updates increase the probability of release failure, it would be unwise to release a new pipeline version immediately prior to a Production platform release. Instead a Production request is submitted for each signed off pipeline artifact, and while the majority are immediately released the Operations team reserve the right to delay if their calendar warns of a pending Production platform release.

Autonomation emphasises the role of root cause analysis, and after every major release failure we hold a session to identify the root cause of the problem, the lessons learned, and the necessary counter-measures to permanently solve the problem. At the time of writing our analysis shows that 13% of release failures were caused by pipeline defects, 10% by misconfiguration of TeamCity Deployment Builds, and the majority originated in our siloed organisational structure. This data provides an opportunity to measure our adoption of the principles of Continuous Delivery according to Shuhari:

  • shu – By scaling our automated release mechanism to manage greenfield and legacy application platforms, we have implemented Repeatable Reliable Process, Automate Almost Everything, and Keep Everything In Version Control
  • ha – By introducing combinational static analysis tests and a pipeline user interface to reduce our defect rate and TeamCity usability issues, we have matured to Bring The Pain Forward and Build Quality In
  • ri – Sky Network Services is a Waterscrumfall organisation where Business, Development, and Operations work concurrently on different projects with different priorities, which means we sometimes fall foul of Conway’s Law and compete over constrained resources to the detriment of cycle time. We have yet to achieve Done Means Released, Everybody Is Responsible, and Continuous Improvement

An example of our organisational structure impeding cycle time would be the first release of the new Messaging application 186-13, which resulted in the following value stream audit:

Messaging 186-13 value stream

While each pipeline operation was successful in less than 20 seconds, the disparity between Commit start time and Production finish time indicate significant delivery problems. Substantial wait times between environments contributed to a lead time of 63 days, far in excess of our average lead time of 6 days. Our analysis showed that Development started work on Messaging 186-13 before Operations ordered the necessary server hardware, and as a result hardware lead times restricted environment availability at every stage. No individual or team was at fault for this situation – the fault lay in the system, with Development and Operations working upon different business projects at the time with non-aligned goals.

With the majority of the Sky Network Services application estate now managed by the Strangler Pipeline it seems timely to reflect upon our goal of retaining our original cycle time of 26 days. Our data suggests that we have been successful, with the cycle time of our Landline Fulfilment and Network Management platforms now 25 days and our greenfield platforms between 18 and 21 days. However, examples such as Messaging 186-13 remind us that cycle time cannot be improved by automation alone, and we must now redouble our efforts to implement Done Means Released, Everybody Is Responsible, and Continuous Improvement. By building the Strangler Pipeline we have followed Donella Meadows‘ change management advice to “reduce the probability of destructive behaviours and to encourage the possibility of beneficial ones” and given all we have achieved I am confident that we can Continuously Improve together.

My thanks to my colleagues at Sky Network Services

The Strangler Pipeline – Legacy and greenfield

The Strangler Pipeline uses the Stage Strangler pattern to manage legacy and greenfield applications

Previous entries in the Strangler Pipeline series:

  1. The Strangler Pipeline – Introduction
  2. The Strangler Pipeline – Challenges
  3. The Strangler Pipeline – Scaling Up

When our Continuous Delivery journey began at Sky Network Services, one of our goals was to introduce a Repeatable, Reliable Process for our Landline Fulfilment and Network Management platforms by creating a pipeline deployer to replace the disparate Ruby and Perl deployers used by Development and Operations. The combination of a consistent release mechanism and our newly-developed Artifact Container would have enabled us to Bring The Pain Forward from failed deployments, improve lead times, and easily integrate future greenfield platforms and applications into the pipeline. However, the simultaneous introduction of multiple business projects meant that events conspired against us.

While pipeline development was focussed upon improving slow platform build times, business deadlines for the Fibre Broadband project left our Fibre, Numbering, and Providers technical teams with greenfield Landline Fulfilment applications that were compatible with our Artifact Container and incompatible with the legacy Perl deployer. Out of necessity those teams dutifully followed Conway’s Law and created deployment buttons in TeamCity housing application-specific deployers as follows:

  • Fibre: A loathed Ant deployer
  • Numbering: A loved Ant deployer
  • Providers: A loved Maven/Java deployer

Over a period of months, it became apparent that this approach was far from ideal for Operations. Each Landline Fulfilment platform release became a slower, more arduous process as the Perl deployer had to be accompanied by a TeamCity button for each greenfield application. Not only did these extra steps increase processing times, the use of a Continuous Integration tool ill-suited to release management introduced symptoms of the Deployment Build antipattern and errors started to creep into deployments.

While Landline Fulfilment releases operated via this multi-step process, a pipeline deployer was developed for the greenfield application platforms. The Landline Assurance, Wifi Fulfilment, and Wifi Assurance technical teams had no time to spare for release tooling and immediately integrated into the pipeline. The pipeline deployer proved successful and consequently demand grew for the pipeline to manage Landline Fulfilment releases as a single aggregate artifact – although surprisingly Operations requested the pipelining of greenfield applications first, due to the proliferation of per-application, per-environment deployment buttons in TeamCity.

A migration method was therefore required for pipelining the entire Landline Fulfilment platform that would not increase the risk of release failure or incur further development costs, and with those constraints in mind we adapted the Strangler pattern for Continuous Delivery as the Stage Strangler pattern. First coined by Martin Fowler and Michael Feathers, the Strangler pattern describes how to gradually wrap a legacy application in a greenfield application in order to safely replace existing features, add new features, and ultimately replace the entire application. By creating a Stage Interface for the different Landline Fulfilment deployers already in use, we were able to kick off a series of conversations with the Landline Fulfilment technical teams about pipeline integration.

We began the Stage Strangler process with the Fibre application deployer, as the Fibre team were only too happy to discard it. We worked together on the necessary changes, deleting the Fibre deployer and introducing a set of version-toggled pipeline deployment buttons in TeamCity. The change in release mechanism was advertised to stakeholders well in advance, and a smooth cutover built up our credibility within Development and Operations.

Deploying Fibre

While immediate replacement of the Numbering application deployer was proposed due to the Deficient Deployer antipattern causing per-server deployment steps for Operations, the Numbering team successfully argued for its retention as it provided additional application monitoring capabilities. We updated the Numbering deployer to conform to our Stage Interface and eliminate the Deficient Deployer symptoms, and then wrote a Numbering-specific pipeline stage that delegated Numbering deployments to that deployer.

Deploy Numbering

The Providers team had invested a lot of time in their application deployer – a custom Maven/Java deployer with an application-specific signoff process embedded within the Artifactory binary repository. Despite Maven’s Continuous Delivery incompatibilitiesbuild numbers being polluted by release numbers, and the sign-off process triggering the Artifact Promotion antipattern, the Providers team resolutely wished to retain their deployer due to their sunk costs. This resulted in a long-running debate over the relative merits of the different technical solutions, but the Stage Strangler helped us move the conversation forward by shaping it around pipeline compatibility rather than technical uniformity. We wrote a Providers-specific pipeline stage that delegated Providers deployments to that deployer, and the Providers team removed their signoff process in favour of a platform-wide sign-off process managed by Operations.

Deploy Providers

As all greenfield applications have now been successfully integrated into the pipeline and the remaining Landline Fulfilment legacy applications are in the process of being strangled, it would be accurate to say that the Stage Strangler pattern provided us with a minimal cost, minimal risk method of integrating applications and their existing release mechanisms into our Continuous Delivery pipeline. The use of the Strangler pattern has empowered technical teams to make their own decisions on release tooling, and a sign of our success is that development of new pipeline features continues unabated while the Numbering and Providers teams debate the value of strangling their own deployers in favour of a universal pipeline deployer.

Deploy Anything

The Strangler Pipeline – Scaling up

The Strangler Pipeline scales via a Artifact Container and Aggregate Artifacts

Previous entries in the Strangler Pipeline series:

  1. The Strangler Pipeline – Introduction
  2. The Strangler Pipeline – Challenges

While Continuous Delivery experience reports abound from organisations such as LMAX and Springer, the pipelines described tend to be focussed upon applying the Repeatable, Reliable Process and Automate Everything principles to the release of a single application. Our Continuous Delivery journey at Sky Network Services has been a contrasting experience, as our sprawling application estate has led to significant scalability demands in addition to more common challenges such as slow build times and unrepeatable release mechanisms.

When pipeline development began 18 months ago, the Sky Network Services application estate consisted of our Network Inventory and Landline Fulfilment platforms of ~25 applications, with a well-established cycle time of monthly Production releases.

However, in a short period of time the demand for pipeline scalability skyrocketed due to the introduction of Fibre Broadband, Landline Assurance, Wifi Fulfilment, Wifi Realtime, and Wifi Assurance:

This means that in under a year our application estate doubled in size to 6 platforms of ~65 applications with the following characteristics:

  • Different application technologies – applications are Scala or Java built by Ant/Maven/Ruby, with Spring/Yadic application containers and Tomcat/Jetty/Java web containers
  • Different platform owners – the Landline Fulfilment platform is owned by multiple teams
  • Different platforms for same applications – the Orders and Services applications are used by both Landline Fulfilment and Wifi Fulfilment
  • Different application lifecycles – applications may be updated every day, once a week, or less frequently

To attain our scalability goals without sacrificing cycle time we followed the advice of Jez Humble and Dave Farley that “the simplest approach, and one that scales up to a surprising degree, is to have a [single] pipeline“, and we built a single pipeline based upon the Artifact Container and Aggregate Artifact pipeline patterns.

For the commit stage of application artifacts, the pipeline provides an interface rather than an implementation. While a single application pipeline would be solely responsible for the assembly and unit testing of application artifacts, this strategy would not scale for multi-application pipelines. Rather than incur significant costs in imposing a common build process upon all applications, the commit interface asks that each application artifact be fully acceptance-tested, provide associated pipeline metadata, and conform to our Artifact Container. This ensures that application artifacts are readily accessible to the pipeline with minimal integration costs, and that the pipeline itself remains independent of different application technologies.

For the creation of platform artifacts, the pipeline contains a commit stage implementation that creates and persists aggregate artifacts to the artifact repository. Whereas an application commit is automatically triggered by a version control modification, a platform commit is manually triggered by a platform owner specifying the platform version and a list of pre-built constituent application artifacts. The pipeline compares constituent metadata against its aggregate definitions to ensure a valid aggregate can be built, before creating an aggregate XML file to act as a version manifest for future releases of that platform version. The use of aggregate artifacts provides a tool for different teams to collaborate on the same platform, different platforms to share the same application artifacts, and for different application lifecycles to be encapsulated behind a communicable platform release version.

While the Strangler Pipeline manages the release of application artifacts via a Repeatable Reliable Process akin to a single application pipeline, the use of the Aggregate Artifact pattern means that an incremental release mechanism is readily available for platform artifacts. When the release of an aggregate artifact into an environment is triggered, the pipeline inspects the metadata of each aggregate constituent and only releases the application artifacts that have not previously entered the target environment. For example, if Wifi Fulfilment 1.0 was previously released containing Orders 317 and Services 192, a release of Wifi Fulfilment 2.0 containing Orders 317 and Services 202 would only release the updated Services artifact. This approach reduces lead times and by minimising change sets reduces the risk of release failure.

A good heuristic for pipeline scalability is that a state of Authority without Responsibility is a smell. For example, we initially implemented a per-application configuration whitelist as a hardcoded regex within the pipeline. That might have sufficed in a single application pipeline, but the maintenance cost in a multi-application pipeline became a painful burden as different application-specific configuration policies evolved. The problem was solved by making the whitelist itself configurable, which empowered teams to be responsible for their own configuration and allowed configuration to change independent of a pipeline version.

In hindsight, while the widespread adoption of our Artifact Container has protected the pipeline from application-specific behaviours impeding pipeline scalability, it is the use of the Aggregate Artifact pattern that has so successfully enabled scalable application platform releases. The Strangler Pipeline has the ability to release application platform versions containing a single updated application, multiple updated applications, or even other application platforms themselves.

« Older posts Newer posts »

© 2024 Steve Smith

Theme by Anders NorénUp ↑