Uncategorised

New Two-Stage Rate Limiting with NGINX Plus

500 300 Will

NGINX Plus R17 has been generally available for some months now and whilst the cornerstone feature of this release introduces NGINX’s support for TLS 1.3 (the latest version of the protocol in securing Internet traffic), I wanted to dedicate this blog to hone in on one of the other features of the all-in-one web server, content cache and load balancer: rate-limiting

Rate-limiting is by no means a novel capability of HTTP servers and NGINX has a long history with its module ngx_http_limit_req_module. Rate-limiting allows us to constrain the number of HTTP requests a user can make over a given period of time. Viewed as a fundamental security mechanism, it’s employed as a primary defense against DDoS attacks by protecting upstream application servers from being inundated.

Prerequisites
  • An NGINX Plus subscription (purchased or trial)
  • A supported operating system
  • root privilege
  • Your NGINX Plus certificate and public key (nginx-repo.crt and nginx-repo.key files), provided by email from NGINX, Inc.
  • I’m installing NGINX Plus on a CentOS7 machine. The reason being is that we will be making use of Siege, a Linux program for HTTP stress testing 

NGINX has a detailed installation guide found on several operating systems here that should get you up and running within minutes. 

Leaky Bucket

NGINX’s rate-limiting follows the industry standard ‘leaky bucket’ theorem, based on the analogy that if the average rate at which water is poured into a bucket exceeds the constant output (or leaking) rate, then additional requests are discarded (or leaked). Notice the emphasis on the term ‘average‘ as it is assumed the input rate could vary, and this is why the concept of ‘bursting‘ is introduced. The advantage of using such a function is that the sequence of bursts are smoothed out by processing them at a median rate. Leaky bucket enforces first in first out (FIFO) processing. NGINX previously handled excessive requests (those exceeding the set rate) by either rejecting those requests immediately or queuing excessive requests until they can be processed later within the limits of the defined rate. 

Firstly, it’s important to understand the rejection logic of the limit_req directive that remains unchanged from previous NGINX Plus releases:

Zone – This sets the ‘shared’ memory in which all incoming requests are collected. Keeping it shared means information can be shared  across the NGINX worker processes. The $binary_remote_addr key holds the binary representation of the client’s IPv4 address; using this key means that state information for approximately 16,000 IP addresses for every 1MB of memory can be stored for a zone. Following the ‘leaky bucket’ theorem this means we are limiting each unique IP address to the request rate defined in (generally) r/s or r/m.

Rate – Sets the maximum rate of requests over a given time interval. If we are not setting a burst rate then this rate is definitive. Say if we have a defined rate of 1r/s, allowing one request every 1 second for a given zone. Supposing the initial request that fits the zone comes in, NGINX sets its ‘request-accepting‘ flag to false as it processes this request. If another request comes in before 1s elapses, that request will be rejected with a 503 status code.

Burst – This is an optional parameter and dictates how many incoming requests the server can accept over the base rate. Best viewed as an extension of ‘leaky bucket’, NGINX issues tokens based on a counter of 1 (an the absolute value of the rate) plus the burst value. Every time the timer ticks over (if 5r/s for example, every 200 milliseconds), the burst value increments by one if it is not at its maximum value already. Therefore, NGINX with a burst value set, accepts excessive requests if burst tokens are available, but will process them once it has capacity to do so (i.e. within the constraints of the rate limit). This is all best explained through the aid of an example, which will be demonstrated in the next steps of this blog. 

Evolution of the Delay

nodelay has pretty much been around since NGINX’s inception, which enforces the burst window to be processed immediately. When the burst window is executed intermittently in line with the rate limit, often this is not very practical as our website may appear slow. They may be a number of resources that need to be pulled down simultaneously such as stylesheets, images and JS code. With nodelay set, further requests (up to burst limit) will be served immediately as long as there is a slot available for it in the queue. 

The R17 release takes this one step further with the introduction of the delay parameter, and with this two-stage rate-limiting with NGINX is born.  This method ensures pages don’t load slowly but at the same time impose more fine-grained throttling capabilities to prevent overload to the back-end. Two-stage rate-limiting means we can handle excessive requests so that they are initially delayed and then ultimately rejected if the rate limit is still exceeded.

If for example, we know we never more than 12 resources per page, we could include a configuration that allow bursts of up to 12 requests, where the first 8 are processed without delay. A delay is then enforced, specified as an absolute value at which excessive requests become delayed. 

I want to note that this is the same example referenced in the NGINX’s documentation, however I feel as though the accompanying explanation is a little misguided. By employing the stress tester program, Siege, together with the diagrams I’ve formulated below, it’s my hope that it will bring a clearer light to the commentary around the demo, by allowing us to look at a more granular, per-request view. 

Let’s move on to visualize this scenario by emulating some HTTP traffic to our web server and take note of the differences in rate-limiting methods. 

To use a simple example, let’s say we have a rate of 5r/s, and a burst of 12. NGINX receives 15 requests at the same time:

  • The first one is accepted and processed
  • Because you allow 1+12, there are 2 requests which are immediately rejected, with a 503 status code
  • The 12 other ones will be treated, one by one, but not immediately. They will be treated at the rate of 5r/s to stay within the limit we set. You can see from the below illustration the span in which these requests are accepted, with one block representing one second.

All this takes place unbeknownst to the upstream server(s) and will only receive requests at the capped rate of 5r/s.

To further this visual representation, we can use a Siege, a multi-threaded HTTP stress tester. This program allows us to simulate a configurable amount of concurrent users executing requests to a web server.

In our example, we’re configuring 15 parallel requests targeted at a rate-limited endpoint. The output of which can be seen below:

[root@nginx-node nginx]# siege -b -r 1 -c 15 -d 1 http://localhost/nodelay:80
** SIEGE 4.0.2
** Preparing 15 concurrent users for battle.
The server is now under siege...
HTTP/1.1 200     0.00 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 503     0.00 secs:     197 bytes ==> GET  /nodelay:80
HTTP/1.1 503     0.00 secs:     197 bytes ==> GET  /nodelay:80
HTTP/1.1 200     0.20 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 200     0.40 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 200     0.60 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 200     0.80 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 200     1.00 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 200     1.20 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 200     1.40 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 200     1.60 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 200     1.80 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 200     2.00 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 200     2.20 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 200     2.40 secs:     612 bytes ==> GET  /nodelay:80

Transactions:                     13 hits
Availability:                  86.67 %
Elapsed time:                   2.40 secs
Data transferred:               0.01 MB
Response time:                  1.20 secs
Transaction rate:               5.42 trans/sec
Throughput:                     0.00 MB/sec
Concurrency:                    6.50
Successful transactions:          13
Failed transactions:               2
Longest transaction:            2.40
Shortest transaction:           0.00

From the above outputted Siege report, we can see the pattern is consistent with what was illustrated in the diagram. Immediately two requests (15 - (burst value + 1)) are rejected and the remaining 13 are handled in accordance with the set rate of 5r/s or one request every 200 milliseconds.

But now we want to introduce some delay, because we want our user(s) to pull down all the resources of our login page without high latency. So let’s configure a delay of 8, which enforces a delay after 8 excessive requests while still maintaining our rejection policy from above, being anything over incoming 12 requests will be rebuffed (prescribed via our burst parameter).

With this configuration in place, and maintaining our previous request stream of 15 parallel requests, we can expect to see behavior as depicted in the time-series diagram below: 

Again, we notice the same number of rejected requests (2), however the time-frame in which the accepted requests are handled is significantly reduced than previously:

  • The eight are proxied without delay by NGINX Plus
  • Again, 2 requests are rebuffed with a 503 status code
  • The remaining 5 are treated, one by one, at a rate of 5r/s, which aptly allows all five to be accepted after one second

The Siege report allows us to see what’s happening as time elapses between requests:

[root@nginx-node nginx]# siege -b -r 1 -c 15 -d 1 http://localhost/delay:80
** SIEGE 4.0.2
** Preparing 15 concurrent users for battle.
The server is now under siege...
HTTP/1.1 200     0.00 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 200     0.00 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 200     0.00 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 200     0.00 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 200     0.00 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 200     0.00 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 200     0.00 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 200     0.01 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 200     0.01 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 503     0.01 secs:     197 bytes ==> GET  /delay:80
HTTP/1.1 503     0.01 secs:     197 bytes ==> GET  /delay:80
HTTP/1.1 200     0.20 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 200     0.40 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 200     0.60 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 200     0.80 secs:     612 bytes ==> GET  /delay:80

Transactions:                     13 hits
Availability:                  86.67 %
Elapsed time:                   0.80 secs
Data transferred:               0.01 MB
Response time:                  0.16 secs
Transaction rate:              16.25 trans/sec
Throughput:                     0.01 MB/sec
Concurrency:                    2.55
Successful transactions:          13
Failed transactions:               2
Longest transaction:            0.80
Shortest transaction:           0.00

It should be noted that in both these examples, the impact of completed requests opening up spots in the burst queue is ignored, which is behavior we would expect to see in reality. With each time increment this would of course allow excessive requests to be consumed within the configured burst size. 

Conclusion

When imposing a delay, we saw no difference in the number of rejected requests, but a modification in the shaping of incoming traffic. Instead of queuing excessive requests (exceeding the rate-limit), which occurs when a delay is not enforced, we’re able to deliver important web content immediately and ensure our users don’t experience high latency. This is a powerful tool if we know the number of resources required to pull down, on a page-to-page basis, allowing us to be more surgical when it comes to handling traffic. 

This may seem trivial in a small-scale demo, however when social media and large e-commerce sites are designing their reverse proxying infrastructure for massive load, the advanced features of NGINX rate-limiting can pave way for higher levels of precision as a rate-limit/traffic-policy enforcer. Otherwise legitimate security threats have every potential to harm the performance, and consequently customer experience of the web application. 

This demo made use of this simple NGINX config :

http {
    limit_req_zone $binary_remote_addr zone=ip:10m rate=5r/s;

    server {
        server_name     localhost;
        root /usr/share/nginx/html;
        listen  127.0.0.1:80;
        access_log /var/log/nginx/access.log combined;

        location /delay {
                limit_req zone=ip burst=12 delay=8;
                try_files /index.html =404;
                }

        location /nodelay {
                limit_req zone=ip burst=12;
                try_files /index.html =404;
                }
        }
}

Infrastructure-as-Code: Securely Testing the CD/CI Pipeline

1024 504 Will

With great power comes great responsibility – an age-old proverb that we can inadvertently apply to Infrastructure-as-Code (IaC). Under a DevOps model, it’s important for organisations to promote effective CD/CI pipelines and not hamper its agility as best they can. However, with the frequency with which code is deployed, there’s every possibility that security is viewed by DevOps teams as an afterthought at best, and a burden at worst. On one hand, IaC is revered by sysadmins and developers worldwide for its ease of use and time to deploy, however on the other, the security risks that come with managing large codebases and the idea that even small code changes can be detrimental mean that the implications can be profound.

Though we as a team were able to stand up complete infrastructure environment using automated configuration management tools, there was trepidation every time we went through a code collation process. Firstly, we didn’t know if the code would work as intended and secondly, there were many unknowns around the undesired effects on the infrastructure as a result of continually committing code. CD/CI has long been ingrained in the application development space for years, however using development test methods in an infrastructure environment should not be met with a cookie cutter approach. This article explores some of the procedures we employed as well as rounding up various best practices from other organisations well mature in their DevOps journeys.

Testing to automate

It’s not this article’s intention to examine the advantages of a CD/CI pipeline, there are several of those out there. Instead, we will look closely through the lens of automated testing, a core element of CD/CI, and contemplate how to handle infrastructure code in ways that might otherwise differ from application code. We typically talk about automated testing falling into two categories: ‘unit’ and ‘integration’.

Unit Testing: Here we’re testing isolated parts of the code individually, without interacting with other components. Unit testing in some cases proved to be a challenge as infrastructure generally needs to be tested as a whole. Unit testing can certainly provide value in syntax validation, but in the absence of complex logic in an Ansible YAML file for example, this testing is relatively primitive.

Integration Testing: In some respects, we can view infrastructure code as essentially being entirely ‘integration’ – once executed you’re changing the state of some infrastructure. A key point I want to emphasise is that every time you merge your code you do not want to be interacting with live infrastructure as continual changes engender a high risk of corrupting the environment.

Certainly, the emergence of IaC has not only drastically sped up the time to deploy but also places more accountability on DevOps teams to protect their infrastructure, away from the traditional jurisdiction of designated security teams. What’s more, if organisations are mature enough in their DevOps practices, security teams will often be reviewing environment changes retrospectively – that speak volumes to the rate of change in the industry today. Now with a high frequency of code merges and consequently, executing several times a day, there is not only a higher probability of corrupting an environment i.e. exhausting API calls from a server to an external service that runs on start-up or unintended consequences of continually rebooting a BMC node, but even a simple configuration change can be susceptible to exposing sensitive data.

IaC testing tools such as Inspec and Terratest certainly provide useful functionality for unit testing, if for example a resolv.conf declaration in my Ansible playbook should include a DNS entry of a set CIDR range or apache must be running as a precursor to a subsequent playbook. These tests can be actioned in seconds and are closest to the developer so short feedback cycles are key advantages here. Additionally, these tools extend to ‘platform readiness’ tests too, where we may need to validate if particular URLs and ports are up and working for the environment to operate. Unfortunately however, these types of tests are usually measured not in seconds, but in minutes, especially if we’re talking about sizeable Terraform or Cloud Formation templates, not to mention the multitude of external moving parts. So too, if we do in fact choose to accept the risk and perform configuration updates to live systems, we would have to have conditions to roll back, in the instance a service or port doesn’t come up within a certain timeframe; for example. There are too many variables in a live production to bake in ‘hard’ conditions in our testing – a database may have twice the load it did yesterday and require a few more seconds to come up today.

Certainly, in a DevOps operating model with developers having an augmented role in administering infrastructure, there is a danger with a group that (arguably) has an instinctual preference to emphasise functionality over security. It becomes very easy then to postulate security as an afterthought when entire environments are defined in JSON or YAML files and deployed at the push of a button. It’s therefore not enough to include security as a checkpoint in the process model, but embed secure practices in the CD/CI pipeline.

The approach

Versioning and blue/green deployment should really be at the heart of a CD/CI platform for our case. A single stack definition should be re-used to create and update each environment (staging, development, production) – think of it as a ‘golden’ stack for infrastructure. Once new code is committed the following process should be triggered:

  1. The CD/CI server detects the change and puts a copy of the new definition file (untested) in the shared repo
  2. The CD/CI server applies the definition version to the first environment
  3. Automated tests are run to ensure all services (critical and non-critical) are still operational (i.e. can still connect to the AD server, required third-party packages can still be installed etc.)
  4. If the above fails, the CD/CI ceases testing and rolls back to the golden version1 definition stack. If successful, the CD/CI server elevates the new definition version to the next environment up the approval chain (model or production)
  5. A new ‘golden’ stack is rubber stamped as taken as the new baseline. Its patching, OS level etc. is able to be solely identified by the image ID of that server.

This approach to carefully notarise environments (coupled with versioning) characteristics this concept of immutable infrastructure (servers are never modified after they’ve been deployed). This allows visibility of what code was used for any environment, at any point in time. Previously, we were hitting the infrastructure with incremental changes without really having any concept of a baseline to roll back to, paving the way for the dreaded ‘configuration-drift’. This meant that a lot of the team’s development effort was wasted troubleshooting because we didn’t know whether the last deployment or preceding deployments were the reason behind the root cause of the bug.

No silver bullet

However, even with a concerted security focus baked into to our integration efforts, some other security mechanisms still fall by the wayside such as vulnerability management. This risk can be somewhat alleviated by performing vulnerability assessment against the golden image. Naturally, we can adopt a higher degree of confidence if we are able to guarantee this ‘master’ as being replicated, un-modified, across the entire environment.

With that being said, some things can only be done in production. Whether that be a commercial constraint that causes the use of different hardware or the most likely of scenarios, not being able to reproduce the sheer scale of production. This then calls for the need of scanning and monitoring tools like Qualys that continually scan for vulnerabilities, detect malware in real-time via an API. With these tools providing deep introspection into the system (and consequently requiring the necessary expertise), it only heightens the need for security personnel to be embedded into DevOps teams from the first day.

Conclusion

The above only further highlights the core traits of DevOps as a collaborative and dynamic practice that evokes not just a continual integration of the code itself, but a recurrent exercise of testing, security compliance and solution realignment (if necessary). In summary, I believe CD/CI in an infrastructure context still provides all the benefits of static validation like it does in the development world, however there should be a strong emphasis on ensuring that there are safeguards in place to not only protect the infrastructure but ensure that there is standardisation and consistency in the way it is rolled out between environments.

 

 

Publishing Service APIs with MuleSoft

1580 517 Will

It’s now imperative for organisations to seize the opportunities of participating in the API economy as the success stories of tech giants like Google, Amazon and Facebook highlight the importance of public APIs to drive their business. At the very least, strategies to open up APIs internally will break down information silos and unlock data. However, by making APIs become publicly readily available, companies can externalise innovation and turn every third-party developer into a product augmenter. 

The evolution of cloud and IoT only increases the value of APIs as every customer touch point will find itself interacting with APIs and elasticity supporting the scaling of those cloud applications. MuleSoft provides an all-in-one platform to handle the entire life-cycle of APIs, from design to maintenance. MuleSoft’s Anypoint Studio allow us to build APIs simply via a graphical interface and one-click access to an extensive library of pre-built connectors and templates. 

TM Forum is a global industry association providing best-practice guidelines and industry research to its enterprise members, of which the majority operate in the telecommunications sector, however also extend to digital service providers and software suppliers. One of TM Forum’s featured programs is its Open API initiative, contributing a suite of over 50 REST APIs that provide a standardised structure for CSPs to manage services and integration between internal systems as well as externally to partners and customers. 

In this guide, we will be looking at using TM Forum’s Open API standard for Activation and Configuration, forming the framework of service creation, specifically an NSX® Edge Services Gateway. For more information on VMware NSX® for vSphere® and the SDDC architecture, click here.

Prerequisites
  • A machine running Windows, OSx or Linux
  • Installed version of Postman. A REST client, Postman will act as the role of the API consumer.
  • A VMware vSphere® environment including required credentials for the NSX® Manager. The NSX® Manager requires port 443/TCP for REST API requests.
Installing and Configuring Java

If you’re up and away with Java 8 skip this step, otherwise here are some basic steps to installing and configuring Java in Windows.

Click here and download the appropriate Java SE 8 for your operating system. 

You will need to edit the path in your system variables to accommodate Java.

  1. Go to Control Panel
  2. Click Advanced System Settings
  3. Click Environment Variables
  4. In the section System Variables, find the PATH environment variable and click Edit.
  5. In the Edit System Variable window; specify the value of the Java PATH (where the JDK sotware is located, for example, C:\Program Files\Java\jdk.1.8.0_201)
Installing Anypoint Studio

Anypoint Studio can be downloaded from https://www.mulesoft.com/lp/dl/studio on a 30-day free trial. 

Note: Following installation of Anypoint Studio 7.x, I noticed there was an issue when importing the RAML specification that goes on to auto-create a Mule flow (detailed in Step 4 below). I have submitted a bug report with MuelSoft Support, however for the time being I will be using an older version of Anypoint Studio (6.5) for this demonstration. 

The Anypoint Studio application appears when the unzip operation completes.  It may be even that you have set your path but Anypoint Studio will still prompt you to install the JDK (shown below).

If this is the case, you can directly point AS to Java for updating AnypointStudio.ini file, located in the same directory as the application. You do this by adding a “-vm” tag and including the path as per your system.

An example is shown below:

Converting TM Forum Swagger 2.0 to RAML

A caveat of working with the TM Forum specification is that RESTful API Modelling Language (RAML) is not supported. As MuleSoft works with RAML or OAS when it comes to API design, we need to convert the TMF640 standard from Swagger 2.0 to RAML.

You can find the Swagger 2.0 source code on TM Forum’s Github repo here.

In terms of converting methods, there are two online converters we could opt to use:

  1. APIMATIC’s online conversion tool (https://www.apimatic.io/transformer) – however you must signup with a valid email to use this 
  2. MuleSoft’s own OAS RAML Converter (https://mulesoft.github.io/oas-raml-converter/)
Create a Project and Import the RAML Specification

We will want to create a new project and import the RAML specification we have just converted. 

  • Go to File > New > Mule Project to create a project, and set the field values in the project wizard:
    1. Enter a Project Name
    2. Under APIkit Settings, select the path of your RAML file
    3. Click Finish

The APIkit provided by Anypoint Studio allows us to create Mule flows from RAML definitions. This should be apparent when Anypoint Studio loads the definition file, you will notice that a message flow has been produced reflective of the methods and resources depicted in the imported schema.  

Anypoint Studio offers a multitude of pre-built connectors that allow connection to almost any endpoint, including those with widely-used third party apps such as Salesforce and SAP. For the purpose of this demonstration, our focus will be limited to HTTP Connectors. The HTTP Connector has a Listener operation that receives requests over HTTP or HTTPS protocol. Receiving a request, and other incidents, initiates a Mule event. This may be one of the first things you will notice when the message flow that has been created. It is best to think of the HTTP Listener as the functionality provider; we will be exposing an API that a consumer (in our case, Postman) will call. However, our message flow just provides a front-end, we are not calling any services and therefore we have to introduce the concept of an HTTP Requester that will facilitate the outbound request to vSphere® to create our NSX® Edge Gateway. In essence, we are both an API provider and client here. A request will be received in the form of a consumer posting data in way of our TM Forum schema (in JSON format), triggering a Mule flow to translate the incoming data in a format that the outbound request to our external HTTPS URL requires. 

Under the General tab of the HTTP Listener, we will define the host and port where the server will be set up, as well as the protocol to use: HTTP for plain connections or HTTPS for TLS connections. We will implement the former. 

Obtaining and translating the NSX® API XML

The NSX® REST API library is defined by a collection of XML documents. The elements defined in these schemas, along with their attributes and composition rules, represent the data structures of NSX® objects. The target of our requests will be the URL of the NSX® Manager communicating over port 443/TCP. 

Using the latest version (at date of publishing) of the NSX® API Guide here, from page 256 we can find instructions on how to deploy an Edge Services Gateway (ESG). Stored in Open Virtual Machine (OFV) format, once requested the API copies the NSX® Edge OVF from the Edge Manager to the specified datastore and deploys an NSX® Edge on the given datacentre.

If we consider the all of the configurable parameters to deploy the virtual appliance, we can make use of the XML request body detailed on page 257 of the above guide. 

However, for this demonstration, we’re only interested in deploying an appliance of minimal/baseline configuration. Therefore, we can trim down the request body to the following as input for our straightforward deployment. 

<edge>
  <name>NSX-Edge-45</name>
  <datacenterMoid>datacenter-1</datacenterMoid>
  <type>gatewayServices</type>
  <appliances>
    <applianceSize>compact</applianceSize>
    <appliance>
      <resourcePoolId>domain-1</resourcePoolId>
      <datastoreId>datastore-1</datastoreId>
    </appliance>
  </appliances>
  <vnics>
    <vnic>
      <index>0</index>
      <portgroupId>dvportgroup-10</portgroupId>
      <addressGroups>
        <addressGroup>
          <primaryAddress>10.10.10.100</primaryAddress>
          <subnetMask>255.255.255.0</subnetMask>
        </addressGroup>
      </addressGroups>
      <isConnected>true</isConnected>
      <macAddress>
        <edgeVmHaIndex>0</edgeVmHaIndex>
      </macAddress>
    </vnic>
  </vnics>
  <cliSettings>
    <userName>admin</userName>
    <password>password</password>
    <remoteAccess>false</remoteAccess>
  </cliSettings>
</edge>
Adding a Reference Flow to trigger the NSX® API

As aforementioned, we will be calling an external NSX® API targeted at a designated NSX® Manager that will pass the data captured by the HTTP Listener, as defined by our TM Forum RAML schema. Therefore, we must identify which combination of method and resource is mot suitable for our use case. Scrolling through, it makes most sense to invoke the Service POST block as our request REST API and therefore we will embed a Flow Reference here. For the moment, we simply need to capture a display name, of which we will reference in the request flow we will create. 

Creating a Request Flow for NSX® API

Now that we have created a “fork” in our flow, the listening request is able to be relayed to a secondary flow that will comprise the external request to the NSX® Manager in order to create the ESG.

  1. From the Mule Palette tab on the right-hand side, search for the Transform Message element
  2. Drag and drop the element onto the Message Flow graphical view
  3. This will automatically create a parent flow too, of which we will have to name, matching that of the Flow Reference we created in the previous step 
Transforming Input Data

The Transform Message component carries out transformations over the input data it receives. This element makes use of DataWeave, MuleSoft’s expression language for accessing and transforming data. We have the choice of either using the UI to build transformations implicitly via drag and drop or explicitly writing it out in DataWeave language. 

  1. Click the Transform Message element
  2. You will notice that our Input Payload is defined as per the TM Forum RAML specification, however we need to define our Output metadata to be mapped. Click Define metadata.
  3. Select Add
  4. Enter a new Type id and select Create type
  5. Change Type to Xml
  6. Change definition from Schema to Example
  7. Browse for the NSX® ESG POST XML document
  8. Click Select

Once the metadata has been auto-populated, we will map the relevant data fields. It is really your prerogative, deciding which fields from the TM Forum specification align most appropriately with the external API. We have designated the serviceCharacteristic subset as appropriate fields for mapping. For the sake of simplicity, of the condensed NSX® request body (including all required fields), we will only map three fields for the consumer (Postman) to populate – the other required fields we will hard-code. This is somewhat reflective of an actual service provisioning request in that you may opt to restrict the selection of some fields on the customer’s behalf. 

The final DataWeave output will look like the following:

%dw 1.0
%output application/xml
---
{
  "edge": {
    "datacenterMoid": "datacenter-1",
    "name": payload.serviceCharacteristic[2].value,
    "type": "gatewayServices",
    "appliances": {
      "applianceSize": payload.serviceCharacteristic[0].value,
      "appliance": {
        "resourcePoolId": "domain-1",
        "datastoreId": "datastore-1"
      }
    },
    "vnics": {
      "vnic": {
        "index": "0",
        "portgroupId": "dvportgroup-10",
        "addressGroups": {
          "addressGroup": {
            "primaryAddress": payload.serviceCharacteristic[1].value,
            "subnetMask": "255.255.255.0"
          }
        },
        "macAddress": { "edgeVmHaIndex": "0" },
        "isConnected": "true"
      }
    },
    "cliSettings": {
      "userName": "admin",
      "password": "password",
      "remoteAccess": "false"
    }
  }
}
Configuring the HTTP Requester

The HTTP Requester will facilitate the external call to the NSX® Manager. This will accept a POST method in our instance, sending the message payload defined in the previous step as the body of our request. The required minimum setting for the request operation is a host URI, which can include a path. However, we will be adding a few additional configurations such as authentication, headers, TLS/SSL and timeout limits.  

  1. Adjoin a HTTP Connector as the subsequent step in the Mule flow

2. On the General tab of the HTTP Connector, update the Display Name as appropriate
3. Update the Method to POST and the URI path (the extension of the base path). NOte: We will set the base path to ‘/api’ in the next steps, however the full path could easily be set here too. 
4. Add the two headers Content-Type and Authorization in the Parameters section. Authorization requires Base64 encoding.

5. Select the Edit icon of the Connector Configuration field
6. Update the URL Configuration parameters (Protocol and Port must be set to HTTPS and 443, respectively) and increase the Idle and Response Timeout values to 120000

7.  Next we will be disabling any TLS Validations. You may have more stringent requirements in your environment. 

8. Using Basic Authentication, enter the login credentials of the NSX Manager

9. For good measure, we will include a Logger in our Mule flow (set at the default log level of INFO). This will aid in any troubleshooting by signalling any error messaging or status notifications. 
10. Additionally, we will add a Set Payload transformer in order to return the JSON response payload. 

Running the Mule project and executing the POST API call via Postman

We can run the project by right-clicking anywhere in the Message Flow area. As the project is compiling, the Console tab will show the Mule Runtime logs. Wait until the console shows the application has completed deployment.

Now we’re ready to call our API via Postman. You’ll notice below that in our JSON Request body, we’ve included a few additional fields to ‘pad it out’, but effectively we are only concerned with the values and arrangement of those in the serviceCharacteristic list (as they are the ones being mapped). Recall that we referenced the index of our values during the DataWeave transformation, so it’s important that we organise our list items accordingly. Enter the URL we set in our HTTP Listener configuration and hit Send!

You can see below that our call returned a 201 response, denoting our service has been provisioned.

When we log in our vSphere® environment, we notice that sure enough, an NSX® Edge VM has been spun up. For some reason however, the NSX® API appends a ‘0’ to the virtual device’s name. I have tried searching for the root cause however to no avail, so if someone is in the know, let us know in the comments below!

Conclusion

And there we have it, we’ve successfully created an end-to-end API that makes use of a best practice industry standard set by TM Forum that transposes to an external API call to VMware’s NSX®. 

Obviously, this is a very simple configuration we have implemented in MuleSoft. Improvements in the way of error handling would certainly need to be put in place in a production environment as error codes form the first and most important step towards not only notifying the user of a failure, but jump-starting the error resolution process.

Writing Custom Ansible Modules Integrating Redfish APIs

1383 758 Will

Redfish is a growing industry standard for RESTful API-driven functionality for the management of commodity hardware. In a cloud arena that is increasingly becoming more multi-vendor and hyper-converged, the need to offer a platform-agnostic tooling set is becoming more vital. The Redfish API makes use of resources, expressed as an OData or JSON schema – the latter of which will be used in this guide. The three main categories of these objects include: systems (CPU memory), managers (management subsystem) and chassis (racks, blades).

Red Hat’s Ansible has surged in recent years as a market leader in configuration management and orchestration, significantly improving the scalability and reliability of data centre infrastructure. From the automation of simple, everyday system administration tasks to full-stack deployments, the ease in which these functionalities are executed thanks to Ansible’s use of YAML, spells a flat learning curve.

Another attraction of Ansible is the plethora of “pre-baked” modules available through its library, Ansible Core, which is continuously augmented and updated by the open source community. So too, due to its popularity, engineers and developers from notable vendors ensure that their contribution to the Ansible “Engine” are regular and reliable. However, it is that last point that results in a vendor-specific library of Ansible modules that only ends up catering to a portion of our multi-vendor environments. Thus, there is a growing need to support the common services and interfaces across our infrastructure fleet.

In this guide, we will go the steps to create a custom Ansible module comprising a Redfish API call to disable DHCP on the OOB Management Controller and designate a fixed IP Address in its place. 

Prerequisites
  • A Linux machine with Python 3 installed – I am using CentOS7. This will be used as your local Ansible server. Generally, we would utilise a server-host architecture whereby the Ansible playbooks are executed against a pool of designated hosts, but as we are interacting with the Out-Of-Band (OOB) controller of a bare-metal server, this architecture is not required.
  • A bare-metal server that supports Redfish functionality, i.e. an Alliance Partner of DMTF. Throughout this guide, we will make reference to the acronym, BMC (Baseboard Management Controller). BMC (accessible at Layer 3) provides the intelligence in the IPMI architecture, a standard interface protocol for the subsystem that facilitates management capabilities of the host’s CPU, firmware and OS. 
Installing Ansible

To begin using Ansible from a centrally controlled node, we need to install the Ansible engine on our Linux machine.

To get the latest version of Ansible for CentOS7, we need to install and enable the EPEL repository:

$ sudo yum install epel-release 

Once the latest version of EPEL has been installed, you can proceed to install Ansible with yum and accept the dependent packages.

$ sudo yum install ansible

We now have all the software we require installed and can begin to write Python code that will form the basis of our custom Ansible module.

Defining Ansible inventory

This is not so much a step as it is a comment around how we will handle Ansible’s concept of inventory in this guide. Typically, we would define the set of targets (hosts) that we intend to apply a configuration (play) against in Ansible’s Inventory. By default, this file can be found at /etc/ansible/hosts. In this file, we are able to compartmentalise our infrastructure into groups that allows users to decide what systems we are controlling and for what purpose. The below is an example of standalone host and a group of hosts:

mail.example.com

[webservers]
foo.example.com
bar.example.com

Ansible uses native OpenSSH for remote connections to the nominated hosts in a given playbook. Our case however is a little unique, in that because our core functionality comprises a Redfish API call, the URL or IP Address of the OOB Controller must be passed as an argument rather than referenced singularly as a host or within a group at the beginning of the playbook. This will make more sense as we go over Step 3 and observe how the API URL is captured as a variable in our Python code.

Instead we will set the hosts: direction to localhost which put simply, ensures any tasks are executed on the local machine.

Writing a Python script as our Ansible module

By default, Ansible will situate its configuration and inventory files under /etc/ansible. When scripting custom modules, I have found it best practice to segregate modules and playbooks (the YAML-syntaxed “instruction manuals” based on the modules) in this same directory. You may wish however, to create these under your home directory if multiple developers are working on the same node.

Firstly, create your Python file with an appropriate title. Ideally, you will want to construct your modules and playbooks with the same title for the sake of traceability and versioning. The module I am going to create will set DHCP to false for the BMC IP Address, so that I can access the OOB management via a static IP.

The following are a collection of micro-steps that will comprise an entire script that will be included at the bottom of this guide.

In our oob_dhcp_set_false.py, make sure to include a shebang line so as the script can be executed as a standalone executable:

#!/usr/bin/python

Ansible likes us to include strings for DOCUMENTATION and EXAMPLES at the top of our module. It is good practice to include a brief synopsis of what functionality your module is providing, including an expected standard structure and parameters of the Ansible playbook that will materialise from our Python script. We can either copy/paste this section from our YAML playbook file that we will create in the next step, or seeing as we are scripting our Python now, we can just create our playbook now essentially. 

"""
# Sample YAML file
#
- hosts: localhost
  gather_facts: no
  tasks:
  - name: Disable DHCP on Out-Of-Band Management Controller
    oob_dhcp_set_false:
      leased_bmc_ip: 127.0.0.1
      fixed_bmc_ip: 10.10.10.10
      fixed_bmc_netmask: 10.10.10.10/24
      bmc_username: username
      bmc_password: password
    register: result

  - debug: var=result
#
# Return Values
#
# failed: one of True or False
# changed: False
# msg: "HTTP Response {{ result }}. DHCP on iLO {{ leased_bmc_ip }} has successfully been disabled. A reset is now required to update the changes."
#
"""

AnsibleModule comes from from ansible.module_utils.basic import *. This must be imported with the *.

Import AnsibleModule along with the following Python modules:

from ansible.module_utils.basic import *
import requests
import json
import os
import urllib3

AnsibleModule provides lots of common code for handling returns, parses your arguments for you, and allows you to check inputs. All of which are important when our end-users are interacting with the front-end playbook element of our functionality.

After we have defined main() as the entrypoint into our module, we establish the accepted parameter type and mandatory/non-mandatory conditions:

module = AnsibleModule(
  argument_spec=dict(
    leased_bmc_ip=dict(type='str', required=True),
    fixed_bmc_ip=dict(type='str', required=True),
    fixed_bmc_netmask=dict(type='str', required=True),
    bmc_username=dict(type='str', required=True),
    bmc_password=dict(type='str', required=True),

  ),
  supports_check_mode=False,
)

leased_bmc_ip = module.params['leased_bmc_ip']
fixed_bmc_ip = module.params['fixed_bmc_ip']
fixed_bmc_netmask = module.params['fixed_bmc_netmask']
bmc_username = module.params['bmc_username']
bmc_password = module.params['bmc_password']

Now it’s time to embed the Redfish POST API into our script, whilst allowing for the relevant arguments to be parsed from the resulting playbook. First, to make things cleaner we will disable warnings so that they do not appear in the playbook’s output.

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

Then we will define the three essential components of the API call namely the URL, headers and payload. Notice that leased_bmc_ip will be passed in as an argument here.

url = 'https://%s/redfish/v1/Managers/1/EthernetInterfaces/1/' % leased_bmc_ip
headers = {'content-type': 'application/json'}
payload = {
        "Oem": {
               "Hpe": {
                       "DHCPv4": {
                              "Enabled": True
                       }
               }
        }
  }

Employing try and except blocks will terminate the try block code execution in the instance of error. If an error is not returned, the status code as well as the BMC IP Address is returned successfully like this:

module.exit_json(changed=True, something_else=12345)

Otherwise, the error results in a transfer down to the except block, and the failures are handled just as simply: 

module.fail_json(msg="Something fatal happened")

The raising of exceptions in this example is handled very primitively, using RequestException which is a base class for HTTPError, ConnectionError, Timeout, URLRequired and others. 

try:
        response=requests.patch(url, data=json.dumps(payload), headers=headers, verify=False, auth=(bmc_username,bmc_password), timeout=30)
        status=response.status_code
        module.exit_json(changed=False, status=status, fixed_bmc_ip=fixed_bmc_ip)
except requests.exceptions.RequestException as e:
        module.fail_json(changed=False, msg="%s" % (e))

We now have a working script that will form the basis of our Ansible playbook!

Writing the Ansible playbook

As has been mentioned several times but perhaps not explained entirely, Playbooks is the shorthand of the modules, designed to be human-readable via its YAML syntax. That’s essentially the beauty of Ansible, a set of instructions (plays) run against a group of targets (hosts) that can be picked up by virtually anyone with ease thus promoting reusuability. 

Most certainly, we bore the brunt of the work when creating our Python script. All we have to do now is produce a simple YAML file reflective of the module we’ve created and that’s all there is to it!

Playbooks always begin with three YAML dashes (-  –  -).

Afterwards, defining a name for the playbook is always good practice and keeps the playbooks reusable. Then we will define a host or set of hosts (referenced as a group in our inventory file, explained in Step 2). However, in our case we will running the playbook against a single ‘host’ (an Out-Of-Band Management controller). On the same indent level as the previous two lines, will go the tasks: statement which is where the plays (modules) designated are executed. As per YAML nesting, these plays are listed another indent deeper (I usually opt for a two-spaced indentation). 

register is used to capture the output of a task to a variable which in our case will be result. We then go onto to make use of the inbuilt Ansible module debug to simply print the output; as the name suggests most useful for debugging variables or expressions – particularly when you are running multiple plays in one playbook and do not want to abort the entire playbook. The last line, when, provides a conditional statement to say only print the debug message in the instance of a successful output from the Redfish API call.

---
# My Ansible playbook
- name: Disable DHCP on Out-Of-Band Management Controller
  hosts: localhost
  gather_facts: false
  tasks:
  - oob_set_dhcp_false:
      leased_bmc_ip: 10.10.10.10
      fixed_bmc_ip: 10.10.10.20
      fixed_bmc_netmask: 255.255.255.0
      bmc_username: username
      bmc_password: password
    register: result

  - debug: msg="HTTP Response {{ result.status }}. DHCP on new OOB IP Address {{ result.fixed_bmc_ip }} has successfully been disabled. A reset is now required to update the changes."
    when: result is succeeded

We can now run our playbook:

$ ansible-playbook oob_set_dhcp_false.yml
Output:
[root@~ playbooks]# ansible-playbook oob_set_dhcp_false.yml

PLAY [Disable DHCP on Out-Of-Band Management Controller] ***************************************************

TASK [oob_set_dhcp_false] **********************************************************************************
 [WARNING]: Module did not set no_log for bmc_password

ok: [localhost]

TASK [debug] ***********************************************************************************************
ok: [localhost] => {
    "msg": "HTTP Response 200. DHCP on new OOB IP Address 10.10.10.20 has successfully been disabled. A reset is now required to update the changes."
}

PLAY RECAP *************************************************************************************************
localhost                  : ok=2    changed=0    unreachable=0    failed=0
Conclusion

And that’s it. Congrats! You’ve successfully created an Ansible module that can disable DHCP and assign a fixed IP Address to a Out-Of-Band Management Controller on a Redfish-supported server.

There’s certainly a lot of room for improvement with our module; including:
  • We could triage the varying error codes from the Redfish API. As is, we are handling any error as generic, but an improvement would be to cater for the likes of 400 BAD REQUEST and 404 NOT FOUND
  • Naturally we would not want to store usernames and passwords as plaintext in our playbook, instead we would use Ansible Vault, a feature allowing encryption of passwords and keys. We will explore Ansible Vault in future posts. 

Here is the final code for our module: