How far is your team from continuous deployment in 2022?

Editor's note ： We often hear the word "continuous deployment" , However, how to achieve continuous deployment ？ How can we achieve continuous deployment ？ This article will provide you with the connotation and implementation path of continuous deployment layer by layer .

planning & edit ｜ Yachun

Cloud R & D Era , The mainstream publishing form has become a service-oriented Publishing Form , This release form provides a realistic basis for continuous release . The premise of release is to deploy the products to be released to the production environment , Therefore, the premise of continuous release is continuous deployment .

Continuously deployed 4 Requirements

Continuous deployment requires continuous provision of a stable and predictable system service . Sometimes there will be downtime during the release process , The system is not available during the period of shutdown and update , This is an unsustainable form of deployment .

We hope the continued deployment of :

First of all, it should be accuracy Of —— The deployment results are accurate and predictable ;

second , Should be reliable Of —— The whole process of online deployment is not affected ;

Third , Should be continued Of —— As continuous deployment takes place , Software increment with sustainable deployment ;

Fourth , The process The cost is low —— The continuous deployment process is low-cost and efficient .

How to do this 4 How about it? ？

1、 accuracy 、 Predictable deployment results

Accurate deployment depends on three prerequisites ： Clear products to be released and configuration 、 Clear operating environment 、 Clear release process and release strategy .

Here is a simple release example ：

The release first has a clear image, That is, the construction products from the upstream . It also contains many configurations , Such as startup configuration 、 Container configuration, etc . The other is the environment , We will configure... In the deployment tool k8s, This configuration will eventually form an environment , And this environment will be in DevOps Used in the process . Finally, we publish the artifacts and configurations to this environment , The release is complete .

therefore , The process of publishing is the process of applying the collection of artifacts and configurations to the collection of environments . First of all, there must be a clear product to be released and operating environment , Secondly, through the corresponding description , Handle products 、 The configuration and environment are clearly described , Form published content , To get to the next step .

The simplest release is kubectl apply, But there are some problems with this release .

First of all , The results are uncertain .kubectl after pod Probably didn't get up ,deployment It may not work , The service may fail , After publishing, you may encounter pod Not enough , There are no resources , These are unknown . So whether the release is successful , The success of the release is uncertain , This is unexpected .

second , Status is not visible . Publishing is not done overnight , It's a gradual process . How many , How many questions , Which traffic has been cut off , These situations are unknown .

Third , The process is uncontrollable . In this release process , An order cannot be withdrawn after it goes on .

If there is a problem with the version , There is a serious Bug, All traffic fell to zero , There is no turning back , Very dangerous . So in the real release process , We need to have interventions , For example, when I find that traffic will lead to a large decline in availability , Need to be able to stop publishing immediately .

No matter what deployment method is adopted , We all want to minimize the impact on online services , This effect is minimized , That is, the deployment process does not affect online services at all . This is our second principle .

2、 The deployment process does not affect online services

Do not affect online services , Yes 4 Requirements ：

First of all , Rolling deployment

Take a grayscale approach , Rolling deployment of most services , When it is confirmed that there is no problem, cut off the flow , Do online service without interruption . Rolling may be too fast , It is necessary to ensure that the interval of each batch is sufficient to monitor and find problems , Have enough time to collect enough data for judgment .

second , Deploy observable

The deployment itself may generate some alarms , For example, the deployment causes the water level of some service nodes to drop , Instead of a drop in the water level of the whole service . Therefore, deployment and monitoring need to get through , First, avoid meaningless alarms , Secondly, we should let the monitoring find the problems caused by deployment in time , For example, deploy two nodes , How about the traffic ？ How about the service ？ Whether the delay increases ？ These need to be monitored .

Third , Ready to intervene

During the deployment process, many uncertain problems may suddenly appear , Some intervention is needed , For example, the operation of diversion , Carry out corresponding cutting flow , Avoid affecting the whole system .

Fourth , Roll back at any time

If your intervention doesn't solve the problem quickly , Then you need to roll back . Be able to roll back at any time , Because there are some failures in the deployment process, the corresponding repair cost is particularly high , Fast rollback , To ensure that the service will not be affected .

Examples of common publishing modes

Here are some common publishing strategies .

（ One ） Grayscale Publishing

The common architecture of gray publishing is as follows . First, there is a load balancing , The service version under load balancing is currently V1, To release a new version is V2, You can pick a node from it , One fifth of the flow is V2.

In this case , All the original Pod All in Deployment1 On , But there is a new Pod Will be in Deployment2 On , from Loadbalancer To Service When routing, some traffic will be routed to new Deployment2 On .

occasionally , In order to control the flow more finely , Also through ingress perhaps mesh Such means , Put a specific flow , such as 5% The inclusion of grey Of cookie The underlying traffic is routed to Deployment2 On .

We expect deployment2 Gradually replace deployment1,deployment1 The flow is slowly replaced 、 Being offline . In the whole process, users are imperceptible , The request is normal , All kinds of monitoring , Basic monitoring , Application monitoring , Business monitoring is normal , This is what we expect .

The most common way to publish grayscale is to generate a new one deployment, Associate the new version of Pod, There are two at the same time in a period of time deployment edition , By constantly adjusting the on both sides Pod Quantity to achieve the purpose of gray Publishing . This is the most common deployment strategy , The cost is also relatively low , The disadvantage is that you can't do very fine flow control , But the service volume is not large, so we can consider this way .

This release form has requirements for services , First, for a specific service, There is at most one release in progress , Because there is a need to constantly switch traffic for the verification process .

second , To one service There can only be one version of deployment function , Two are not allowed to exist at the same time .

Third , In the whole process, there are two versions of deployment, There are two versions of the service offering , To ensure that these two versions of services can be provided correctly , Whatever it is , What is downstream , Can correctly handle business needs .

Fourth , The whole publishing process cannot cause service interruption . If ordinary short connection service , Make sure that there is one session There will be no disconnection or discontinuity due to release . If it is a long connection, ensure that the connection can be automatically migrated to the new service .

Last , The whole publishing process will not cause errors requested by users , Instead, there will be an elegant offline mechanism to ensure that it does not accept new requests after processing , In this case, the desired gray publishing effect can be guaranteed .

So the whole gray publishing process is not just about publishing tools , The published strategy has some requirements , There are also many requirements for the application itself , To achieve a very smooth gray release .

Based on this , We have summarized some suggestions for gray publishing practice for your reference .

First of all , We suggest that the application needs to ensure the accuracy of the previous （ Or several ） Version compatible . The amount of compatibility of this version depends on the online situation of the application , Sometimes there are several versions of applications online at the same time , We need to ensure compatibility with these versions .

second , Create a new deployment, Offer the same service, Through adjustment pod Number or ingress Flow to carry out gray scale analysis , In this case, it can be finely controlled , It is suggested to control the flow .

Third , Define grayscale batches and the proportion and observation time of each batch . Gray scale batch should be designed reasonably , Ensure that the interval between each batch is enough for us to find problems and deal with them . If the gray interval is particularly short , It is possible to enter the next larger batch before the monitoring has time to alarm , It may bring great risks .

Fourth , In addition to focusing on basic monitoring and application monitoring , We also need to pay attention to business monitoring data . Monitoring is a big area , But from a publishing perspective , Our ultimate goal is to avoid the business loss caused by the release , Publishing may result in business unavailability , Or business error , What's more serious is that the release causes some observation indicators of the business to change greatly , For example, the user conversion rate or the number of successful user logins are abnormal . These abnormal data should be found in time , And immediately pause .

The fifth , When the release process is complete , You should first switch the flow and observe , Don't rush to clean up pod, Ensure more efficient rollback in the future . If this pod still , You'll be able to cut the flow back soon , It can shorten the time when online services are affected .

The sixth , Record the released version , Facilitate rollback . In addition to the specific version, we also need to know where we have deployed , This is convenient for rollback . Record the corresponding version , If the compliance check automation is done better , You can roll back with one click .

The seventh , Rollback is different from republishing . Rollback is different from the published policy , It can't be the same as publishing. Each batch is very small , In order to solve the problem, we need to reduce the number of batches 、 Shorten the time 、 Fast rollback .

Last , If the system supports multi tenancy , It is recommended to isolate and manage traffic based on tenants AB test , In especial AB It's convenient to test .

（ Two ） Blue and green deployment

Another common deployment method is blue-green deployment ：

Blue green deployment is similar to grayscale , Just need more resources . This depends on the deployment form of the software , And the number of machine resources . Blue and green have lower requirements for software than gray , It can ensure that all businesses are deployed before cutting , But I can't , Be able to deploy continuously . But the risk of blue and green is also relatively high , Once something goes wrong, it's global .

Do not affect online services , In addition to the deployment strategy , There will be other problems as well , For example, the software is only half developed , Or the service can be provided to users as a complete system service only when it is deployed and wants to cooperate with other services , At this time, we need to use the characteristic switching mode .

Characteristic switches are essentially a special kind of configuration , It is generally distributed in the form of dynamic configuration . It can be deployed continuously at ordinary times , But the switch remains off , Wait until the client or front end is released , Turn the switch on again . So strictly speaking, the opening of the feature switch itself is also a release , Feature switches themselves also require versioning .

The ultimate goal we want to achieve , Is that anyone can release software at any time . It means , Your service can be sent at any time , Anyone can safely send , The operation of publishing is very simple , No special skills are required , And there will be no big problems after the release , Even if there is a problem, it can be solved quickly .

therefore , Our vision is ： anytime , Any application can be released online .

For Alibaba, it can be embodied as ： double 11 Not blocked , double 11 When you want to send it . Actually in double 11 In the process , There are also many urgent releases , There needs to be a very complete technical guarantee , Ensure the safety and reliability of the release , Because if there is a problem, it may be a public opinion failure . And the more this time, the more likely it is to produce some avalanche effects , It may bring a series of faults and problems , Finally, the whole system will break down .

3、 Software increment for sustainable deployment

Do continuous deployment , Persistence is the key . The group at the top of the picture , Mona Lisa's smile is a small piece every time , Finally, the third 5 block , however 1、2、3、4 Each is incomplete , And the following 1、2、3、4、5 Each one is complete , But constantly enriching the details . What it wants to express is ：

（1） Our software increment should correspond to a clear demand value point , There is a clear demand value point that can be delivered .

（2） second , The increment of software should be complete , It is a unit that can be published independently .

（3） Third , The increment of software should be able to be independently verified .

KentBeck Say a word ：

in other words Integration is a very important thing , Because in the vast majority of software development cooperation processes, we are a process of splitting and solving problems and then integrating them .

Integration has the following three steps ： Submit code , Packaged deployment , verification . this 3 The first step is very simple .

The purpose of integration is to verify the integrity of , Verify that the combined code can build , Be able to complete the verification of corresponding functional test , Help us identify risks as soon as possible . Therefore, we should integrate as soon as possible , The batch of each integration should be very small .

Two units ： A deployable unit in terms of deployment ; Another aspect of integration is the integrable unit .

The deployed units are releasable to testable , Is the perspective of demand . Increment is a requirement , A characteristic , What users can see , Functions that can be used . The other is an integrable unit , That is, many buildable units , Can be logically built together , Complete unit tests , Then do code level verification , This is the perspective of the code .

After the code is submitted , The code analysis , Compiling and constructing , It's all about checking the quality of the code . Successful compilation often gives us first-hand feedback . The compiler helps us find some problems in the process of writing code . Build itself , If the speed is fast , Programmers especially like to use , Once he can't make it up, he knows what's wrong , Then unit test , Integration testing , A functional test , Finally, enter the status to be released . So the first half of the blue part is for continuous integration , To reach the status to be released .

4、 Low cost 、 Deploy and publish efficiently

With products to be released , How to deploy and publish efficiently at low cost ？

First, let's look at some common questions . The most common is deferred integration , For example, once a month , Submit one batch every month . The second is accumulated liabilities , The trunk has been unstable , There are a lot of problems , Never pass , This is the cumulative liability .

The third is no test automation , The whole test is completely guaranteed by hand , Or there is test automation but unstable , There is no way to rely on , At this time, it is entirely up to people to judge whether the test is OK. The fourth is rework , Often because of quality problems or defects, the release activities are often repeated , Bring a lot of time waste .

Another is a time-consuming activity , For example, check the code manually , Manual approval at each stage , Look at the quality of each stage manually , These will take a lot of time , As a result, the whole release is inefficient . When the software completes a certain work, it enters a state , To enter the next state, judge the state transition manually , At this time, it takes a long time , Because there is feedback waiting , As a result, it is relatively difficult to do it efficiently during the whole release .

The above figure shows the release statistics of the two applications , It's on it A application , Here is B application , Each point represents a release , A green dot indicates a successful release , The yellow dot indicates that the release has been cancelled , Red indicates publishing failed . The vertical axis is the time-consuming of this release . The horizontal axis indicates the date on which the release was completed , So the vertical axis represents the duration , The horizontal axis represents the point in time .

In fact, these two applications are not done very well . The problem with the first application is that the frequency of release is very low , Most of the time, it is sent once or twice a month , But the success rate of the release is still relatively high . The second application is released more frequently , Maybe once in a few days , But the failure rate is very high , The number of failures is far greater than the number of successes .

So both have problems , And the whole release time of these two applications is relatively long , Often to 24 Hours or more . If the release exceeds 8 Hours means you can't do it in a day , Need to work overtime , Because publishing is a high-risk thing , Many companies need to keep an eye on the market when they release , It's impossible to leave before it's finished . In this case, if the release takes more than a day , Suppose two people take turns 12 Hours .

For example , Many enterprises put their integration time on Tuesday , The release time is on Thursday , Because the default release will work overtime , And I can't decide on Thursday , It will continue to be sent on Friday , So if it's on Friday , You need to work overtime on Saturday . In many cases , Even on Thursday , In most cases, it can't be sent until Friday night or Saturday , There are also many messages sent to Sunday .

From these two pictures , We found that A The frequency of application release is very low , Only once in a long time , in addition B Many applications fail to release , There are some corresponding risks , For example, it takes a long time , And it's easy to make mistakes , It's hard to release on demand .

Take a close look at B application , If there are multiple red dots and a green dot in the approach time , It often means that successive releases fail , Need urgent repair before release . This means that the software is likely to be at risk in the middle of an emergency repair . Want continuous, fast and high quality , Feel free to publish it boldly , We need to mention the ability of integration and release .

The means of rapid integration include reducing batch and keeping smooth . The granularity and utilization of resources are related to each other , The cycle time of small batch is relatively short , Generally speaking, the cycle time of mass production is relatively long , The resource utilization rate is relatively close , It won't be a big problem , Therefore, by reducing the batch size, the cycle time can be shortened as much as possible , When it gets shorter, the frequency will increase , More frequency and faster feedback , The response to application fixes or corresponding problems will become faster . second , Keep it smooth . After solving the problems inside , To make this road smooth .

First of all , Reduce batch , We have just mentioned from the granularity of requirements and the granularity of code , Need releasable units , As few testable units as possible , Buildable units , The granularity of single testable unit code should be as small as possible .

In terms of keeping it smooth , We have a lot of practice , The most common is all kinds of Automation , For example, testing can be automated , Automation of construction , Deployment automation , The whole process can be automated , State migration can be automated and so on .

second , To manage exceptions , That is to avoid rear end collision during the release process , The exception should be fixed first , Then let the whole process smooth . When there are problems in the released pipeline , You should stop first , Don't... First checkin, Don't trigger , Solve the problem first . Let the trunk get better first , Then continue with the rest of the work , Make sure the trunk is repaired first , The rest of the time to do the rest of the integration . Some enterprises will ask who submits the code and who is responsible for solving the problem , If it can't be solved in half an hour or within a limited time , The system will automatically remove the code , So that the latter can continue to integrate .

Another is to reduce dependence , If the integration and release process , There is a lot of dependence on the outside , This can also cause congestion , Because dependence leads to waiting .

The other is quality built-in , Only after the upstream quality is built in can the downstream speed be guaranteed . If the upstream does not solve the corresponding problem , The downstream will definitely be blocked there . We should look for problems from the upstream as soon as possible , Try to use an upstream thinking to think about how to ensure that the downstream becomes better earlier .

Give timely feedback Is the same , If there is a problem, feed it back to the specific person accurately and timely . Avoid junk feedback , Avoid disturbing developers with too much useless or irrelevant information , This leads developers to lose trust in the whole feedback mechanism .

And finally Reuse , Reuse as much as you can , Avoid making wheels repeatedly .

The above is what we think of as continuous deployment 4 Three principles and practical suggestions .

If we want to make continuous deployment in the large-scale landing of enterprises , Need good tool support . therefore Next on , We will share how to implement the above practices into the tool with the help of continuous release pipeline , In order to spread to the whole team .