With great power comes great responsibility – an age-old proverb that we can inadvertently apply to Infrastructure-as-Code (IaC). Under a DevOps model, it’s important for organisations to promote effective CD/CI pipelines and not hamper its agility as best they can. However, with the frequency with which code is deployed, there’s every possibility that security is viewed by DevOps teams as an afterthought at best, and a burden at worst. On one hand, IaC is revered by sysadmins and developers worldwide for its ease of use and time to deploy, however on the other, the security risks that come with managing large codebases and the idea that even small code changes can be detrimental mean that the implications can be profound.
Though we as a team were able to stand up complete infrastructure environment using automated configuration management tools, there was trepidation every time we went through a code collation process. Firstly, we didn’t know if the code would work as intended and secondly, there were many unknowns around the undesired effects on the infrastructure as a result of continually committing code. CD/CI has long been ingrained in the application development space for years, however using development test methods in an infrastructure environment should not be met with a cookie cutter approach. This article explores some of the procedures we employed as well as rounding up various best practices from other organisations well mature in their DevOps journeys.
Testing to automate
It’s not this article’s intention to examine the advantages of a CD/CI pipeline, there are several of those out there. Instead, we will look closely through the lens of automated testing, a core element of CD/CI, and contemplate how to handle infrastructure code in ways that might otherwise differ from application code. We typically talk about automated testing falling into two categories: ‘unit’ and ‘integration’.
Unit Testing: Here we’re testing isolated parts of the code individually, without interacting with other components. Unit testing in some cases proved to be a challenge as infrastructure generally needs to be tested as a whole. Unit testing can certainly provide value in syntax validation, but in the absence of complex logic in an Ansible YAML file for example, this testing is relatively primitive.
Integration Testing: In some respects, we can view infrastructure code as essentially being entirely ‘integration’ – once executed you’re changing the state of some infrastructure. A key point I want to emphasise is that every time you merge your code you do not want to be interacting with live infrastructure as continual changes engender a high risk of corrupting the environment.
Certainly, the emergence of IaC has not only drastically sped up the time to deploy but also places more accountability on DevOps teams to protect their infrastructure, away from the traditional jurisdiction of designated security teams. What’s more, if organisations are mature enough in their DevOps practices, security teams will often be reviewing environment changes retrospectively – that speak volumes to the rate of change in the industry today. Now with a high frequency of code merges and consequently, executing several times a day, there is not only a higher probability of corrupting an environment i.e. exhausting API calls from a server to an external service that runs on start-up or unintended consequences of continually rebooting a BMC node, but even a simple configuration change can be susceptible to exposing sensitive data.
IaC testing tools such as Inspec and Terratest certainly provide useful functionality for unit testing, if for example a resolv.conf declaration in my Ansible playbook should include a DNS entry of a set CIDR range or apache must be running as a precursor to a subsequent playbook. These tests can be actioned in seconds and are closest to the developer so short feedback cycles are key advantages here. Additionally, these tools extend to ‘platform readiness’ tests too, where we may need to validate if particular URLs and ports are up and working for the environment to operate. Unfortunately however, these types of tests are usually measured not in seconds, but in minutes, especially if we’re talking about sizeable Terraform or Cloud Formation templates, not to mention the multitude of external moving parts. So too, if we do in fact choose to accept the risk and perform configuration updates to live systems, we would have to have conditions to roll back, in the instance a service or port doesn’t come up within a certain timeframe; for example. There are too many variables in a live production to bake in ‘hard’ conditions in our testing – a database may have twice the load it did yesterday and require a few more seconds to come up today.
Certainly, in a DevOps operating model with developers having an augmented role in administering infrastructure, there is a danger with a group that (arguably) has an instinctual preference to emphasise functionality over security. It becomes very easy then to postulate security as an afterthought when entire environments are defined in JSON or YAML files and deployed at the push of a button. It’s therefore not enough to include security as a checkpoint in the process model, but embed secure practices in the CD/CI pipeline.
Versioning and blue/green deployment should really be at the heart of a CD/CI platform for our case. A single stack definition should be re-used to create and update each environment (staging, development, production) – think of it as a ‘golden’ stack for infrastructure. Once new code is committed the following process should be triggered:
- The CD/CI server detects the change and puts a copy of the new definition file (untested) in the shared repo
- The CD/CI server applies the definition version to the first environment
- Automated tests are run to ensure all services (critical and non-critical) are still operational (i.e. can still connect to the AD server, required third-party packages can still be installed etc.)
- If the above fails, the CD/CI ceases testing and rolls back to the golden version1 definition stack. If successful, the CD/CI server elevates the new definition version to the next environment up the approval chain (model or production)
- A new ‘golden’ stack is rubber stamped as taken as the new baseline. Its patching, OS level etc. is able to be solely identified by the image ID of that server.
This approach to carefully notarise environments (coupled with versioning) characteristics this concept of immutable infrastructure (servers are never modified after they’ve been deployed). This allows visibility of what code was used for any environment, at any point in time. Previously, we were hitting the infrastructure with incremental changes without really having any concept of a baseline to roll back to, paving the way for the dreaded ‘configuration-drift’. This meant that a lot of the team’s development effort was wasted troubleshooting because we didn’t know whether the last deployment or preceding deployments were the reason behind the root cause of the bug.
No silver bullet
However, even with a concerted security focus baked into to our integration efforts, some other security mechanisms still fall by the wayside such as vulnerability management. This risk can be somewhat alleviated by performing vulnerability assessment against the golden image. Naturally, we can adopt a higher degree of confidence if we are able to guarantee this ‘master’ as being replicated, un-modified, across the entire environment.
With that being said, some things can only be done in production. Whether that be a commercial constraint that causes the use of different hardware or the most likely of scenarios, not being able to reproduce the sheer scale of production. This then calls for the need of scanning and monitoring tools like Qualys that continually scan for vulnerabilities, detect malware in real-time via an API. With these tools providing deep introspection into the system (and consequently requiring the necessary expertise), it only heightens the need for security personnel to be embedded into DevOps teams from the first day.
The above only further highlights the core traits of DevOps as a collaborative and dynamic practice that evokes not just a continual integration of the code itself, but a recurrent exercise of testing, security compliance and solution realignment (if necessary). In summary, I believe CD/CI in an infrastructure context still provides all the benefits of static validation like it does in the development world, however there should be a strong emphasis on ensuring that there are safeguards in place to not only protect the infrastructure but ensure that there is standardisation and consistency in the way it is rolled out between environments.