Building robust and opinionated CI/CD pipelines
Table of Contents
Introduction
Continuous Integration, Continuous Delivery, and related practices are hot topics in the DevOps space. They have many benefits for both large and small teams. However, with great power comes the ability to make massive messes of your infrastructure if not implemented correctly. As a distributed software engineer over the past decade, I’ve been directly and indirectly working on complex CI/CD pipelines at startups and mid-size companies. I’ve worked on AWS CodePipeline, Travis, GitLab, GitHub Actions and recently BuildKite and ArgoCD. This article explores how to make consistent and informed plans for CI/CD pipelines that support you throughout your entire SDLC.
Overview
Most companies that have tried implementing CI/CD have done so unevenly and inconsistently across their teams, projects and technologies. Each team has their own stack of bash scripts or projects containing YAML files to define build configurations. These artifacts are often stored in different locations and may be managed differently depending on the project or team. What this results in is resources all over your infrastructure that lack consistency and are difficult to manage at scale. This will not achieve all of the potential benefits of CI/CD throughout your organization.
On the other hand, effective CI/CD pipelines can significantly increase your software development velocity. At my previous company by migrating from old Jenkins pipelines to modern and standardized GitLab pipelines on Kubernetes we were able to increase 50% of automation test coverage and increased our deployment frequency in drastically by approximately 1000%.
Pipeline Design
The first step obviously is design. At my previous company, we had a rule that allowed engineers to spend 20% of their time on inner source projects therefore the management decided to give the power of designing our new CI/CD pipelines to multiple working groups. Basically, the idea was to ask a large group of engineers to use their 20% to design some pipelines based on their needs and then they would share the newly created pipelines with the whole company to gather feedback. Initially, the idea sounded great and everybody was excited about it but in action, it was not as great as we thought it would be. In my opinion there were two main simple reasons for that 1) designing CI/CD pipelines is not a part-time job it’s indeed a full-time job hence having engineers spend 20% of their time with the working groups on the new pipelines was not sufficient 2) in order to have consistency we wanted the new pipelines to be opinionated and like everything else in our life it was impossible to keep everybody happy therefore the pieces of feedback that the working groups received weren’t helpful and interesting enough in some cases were mutually exclusive. Good news was the solution to those problems was very simple: give the whole project to one single strong team. We already had a DevOps team working mostly on our cloud infrastructure so we asked them to take over the project. Although the number of engineers on the DevOps team was a fraction of what we had in the working groups the end result was significantly better. They started small by supporting only backend ECS services which at that time would roughly cover 70% of our deployments and step by step they started supporting other types of deployments e.g. lambda functions, SSR applications etc since they were laser-focused they were able to move much faster in identifying and implementing the critical features.
Categorization
After forming a dedicated team the second step is to categorize your deployments. According to my experience, most of the modern applications can be categorized into three main categories. The first category and the most popular one is “services and workers”. These types of applications will be deployed to a container orchestration platforms like ECS or Kubernetes. The second most popular category is “mobile applications”. These applications will be developed for iOS or Android operating systems and the process of building, testing, and deploying them could be entirely different than the first category. The third category is FaaS (function as a service). These type of application will be deployed to AWS Lambda or Google functions or Azure Functions. Although this category could be very similar to the first category, in reality, the tools for building and deploying FaaS applications are quite different than ECS or Kubernetes services. For instance, Serverless CLI is the most popular tool for building and deploying AWS Lambda functions which makes the process very different than ECS or K8s.
Bear in mind in this post, I focused on the most popular categories but there are some other relatively popular categories for example you might need separate pipelines for scheduled tasks, cron jobs or building base docker images ot Helm charts, or even updating your Terraform infrastructure.
Compliances
There an interesting tendency among application developers that I’ve been observing since I started to work more closer on infrastructure projects. If application developers see a “non-critical” CI/CD job fail and they don’t know how to fix it they either ignore it as long as they can or even worse they immediately disable it to unblock themselves. To be fair that behavior is totally understandable since most application developers are constantly dealing with product initiatives and product deadlines. I remember in one occasion a few of our teams disabled some of our security jobs because they didn’t have the bandwidth to update their projects’ dependencies. That’s a dangerous behavior and that’s where compliance frameworks can save you. As the name indicates, the beauty of compliance frameworks is that nobody has permission to modify the compliance configuration and compliance jobs except for a handful of people on the infrastructure team. Eventually, all of your critical jobs including security, quality, testing, SOX, and QA jobs should be moved to compliance frameworks meaning that if you’re using a standard pipeline you don’t have permission to disable dependency or security checks in your repository. Just to clarify, you might only want to be strict about your critical jobs other steps in your pipelines should be fully customizable.
Migrations
Now you have all the newly redesigned, standardized, and consistent pipelines, how do you actually migrate over to these pipelines without making a huge disruption? The short answer is it depends on your company and companies do the migration process differently but I think normally there are two main approaches. Let me use an analogy to explain these approaches. Let’s say your AC is broken, what do you do? Do you call some technicians? Or do you do that yourself with some help from your family members or your friends? if you fix that by yourself it will be cheaper but very time-consuming. The AC unit is heavy so you definitely need one or two people to help you. Since you’re not an expert it’s possible that you would make some mistakes along the way. On the other hand, if you call some technicians it will be more expensive but they will do it quickly and they will bring all the required tools and extra workers if needed. Plus since they’re subject-matter experts and have fixed hundreds of AC units before most likely they won’t make any mistakes. I think you got the idea here. If you have enough resources on your infrastructure team or you can afford hiring contractors it will make a lot of sense to use your subject-matter experts rather than delegating the migration process to the team themselves. At one of my previous jobs, we didn’t have enough resources on our infrastructure team and we didn’t have the budget to hire contractors so we documented all the steps thoroughly then we ask each and every team to do the migrations themselves. We had a few hundred active repositories with many excited and committed engineers. In addition, we had product managers agreed to give engineers more bandwidth to work on the pipelines instead of product initiatives. Since everyone was onboard we estimated that the whole migration process would probably take around 12 months. To me, it was an ambiguous goal but totally doable. However, a few months into the process we realized things were not moving as fast as we thought. Although all the steps were documented engineers were struggling. You might ask why? Exactly like the AC analogy those engineers weren’t subject-matter experts. If something went wrong they didn’t know how to troubleshoot that and they had to ask our subject-matter experts on our infrastructure team. As a result of that, a year and a half later we had an overwhelmed infrastructure team and a bunch of frustrated engineers, and the migration process became super slow and sort of halted. The sad part is although management was unhappy but didn’t provide any solutions. Three years later we only had 65% of our repositories migrated over and I think probably it’d take two more years for the company to be completely off of the latency pipelines. The lesson here is if you can afford to build an army of technicians that can go into each and every project and update the pipelines do it and don’t hesitate trust me it’s totally worth it just don’t forget to document everything for your application developers so that they can debug feature issues easier.
Conclusion
I cannot emphasize enough that having robust CI/CD pipelines is greatly beneficial. It can significantly improve your velocity, productivity, and number and frequency of releases. If you have a distributed system it can bring you consistency in your microservices and your infrastructure. It can also help you to be more disciplined by providing compliances to make sure everything is up to your security and quality standards. I understand that it’s not a trivial task having robust pipelines could be very involving from designing to maintaining to doing the actual migrations all these steps require proper planning and resources. I hope sharing my experience would help you to be better prepared for the challenges.