New Two-Stage Rate Limiting with NGINX Plus

New Two-Stage Rate Limiting with NGINX Plus

500 300 Will

NGINX Plus R17 has been generally available for some months now and whilst the cornerstone feature of this release introduces NGINX’s support for TLS 1.3 (the latest version of the protocol in securing Internet traffic), I wanted to dedicate this blog to hone in on one of the other features of the all-in-one web server, content cache and load balancer: rate-limiting

Rate-limiting is by no means a novel capability of HTTP servers and NGINX has a long history with its module ngx_http_limit_req_module. Rate-limiting allows us to constrain the number of HTTP requests a user can make over a given period of time. Viewed as a fundamental security mechanism, it’s employed as a primary defense against DDoS attacks by protecting upstream application servers from being inundated.

Prerequisites
  • An NGINX Plus subscription (purchased or trial)
  • A supported operating system
  • root privilege
  • Your NGINX Plus certificate and public key (nginx-repo.crt and nginx-repo.key files), provided by email from NGINX, Inc.
  • I’m installing NGINX Plus on a CentOS7 machine. The reason being is that we will be making use of Siege, a Linux program for HTTP stress testing 

NGINX has a detailed installation guide found on several operating systems here that should get you up and running within minutes. 

Leaky Bucket

NGINX’s rate-limiting follows the industry standard ‘leaky bucket’ theorem, based on the analogy that if the average rate at which water is poured into a bucket exceeds the constant output (or leaking) rate, then additional requests are discarded (or leaked). Notice the emphasis on the term ‘average‘ as it is assumed the input rate could vary, and this is why the concept of ‘bursting‘ is introduced. The advantage of using such a function is that the sequence of bursts are smoothed out by processing them at a median rate. Leaky bucket enforces first in first out (FIFO) processing. NGINX previously handled excessive requests (those exceeding the set rate) by either rejecting those requests immediately or queuing excessive requests until they can be processed later within the limits of the defined rate. 

Firstly, it’s important to understand the rejection logic of the limit_req directive that remains unchanged from previous NGINX Plus releases:

Zone – This sets the ‘shared’ memory in which all incoming requests are collected. Keeping it shared means information can be shared  across the NGINX worker processes. The $binary_remote_addr key holds the binary representation of the client’s IPv4 address; using this key means that state information for approximately 16,000 IP addresses for every 1MB of memory can be stored for a zone. Following the ‘leaky bucket’ theorem this means we are limiting each unique IP address to the request rate defined in (generally) r/s or r/m.

Rate – Sets the maximum rate of requests over a given time interval. If we are not setting a burst rate then this rate is definitive. Say if we have a defined rate of 1r/s, allowing one request every 1 second for a given zone. Supposing the initial request that fits the zone comes in, NGINX sets its ‘request-accepting‘ flag to false as it processes this request. If another request comes in before 1s elapses, that request will be rejected with a 503 status code.

Burst – This is an optional parameter and dictates how many incoming requests the server can accept over the base rate. Best viewed as an extension of ‘leaky bucket’, NGINX issues tokens based on a counter of 1 (an the absolute value of the rate) plus the burst value. Every time the timer ticks over (if 5r/s for example, every 200 milliseconds), the burst value increments by one if it is not at its maximum value already. Therefore, NGINX with a burst value set, accepts excessive requests if burst tokens are available, but will process them once it has capacity to do so (i.e. within the constraints of the rate limit). This is all best explained through the aid of an example, which will be demonstrated in the next steps of this blog. 

Evolution of the Delay

nodelay has pretty much been around since NGINX’s inception, which enforces the burst window to be processed immediately. When the burst window is executed intermittently in line with the rate limit, often this is not very practical as our website may appear slow. They may be a number of resources that need to be pulled down simultaneously such as stylesheets, images and JS code. With nodelay set, further requests (up to burst limit) will be served immediately as long as there is a slot available for it in the queue. 

The R17 release takes this one step further with the introduction of the delay parameter, and with this two-stage rate-limiting with NGINX is born.  This method ensures pages don’t load slowly but at the same time impose more fine-grained throttling capabilities to prevent overload to the back-end. Two-stage rate-limiting means we can handle excessive requests so that they are initially delayed and then ultimately rejected if the rate limit is still exceeded.

If for example, we know we never more than 12 resources per page, we could include a configuration that allow bursts of up to 12 requests, where the first 8 are processed without delay. A delay is then enforced, specified as an absolute value at which excessive requests become delayed. 

I want to note that this is the same example referenced in the NGINX’s documentation, however I feel as though the accompanying explanation is a little misguided. By employing the stress tester program, Siege, together with the diagrams I’ve formulated below, it’s my hope that it will bring a clearer light to the commentary around the demo, by allowing us to look at a more granular, per-request view. 

Let’s move on to visualize this scenario by emulating some HTTP traffic to our web server and take note of the differences in rate-limiting methods. 

To use a simple example, let’s say we have a rate of 5r/s, and a burst of 12. NGINX receives 15 requests at the same time:

  • The first one is accepted and processed
  • Because you allow 1+12, there are 2 requests which are immediately rejected, with a 503 status code
  • The 12 other ones will be treated, one by one, but not immediately. They will be treated at the rate of 5r/s to stay within the limit we set. You can see from the below illustration the span in which these requests are accepted, with one block representing one second.

All this takes place unbeknownst to the upstream server(s) and will only receive requests at the capped rate of 5r/s.

To further this visual representation, we can use a Siege, a multi-threaded HTTP stress tester. This program allows us to simulate a configurable amount of concurrent users executing requests to a web server.

In our example, we’re configuring 15 parallel requests targeted at a rate-limited endpoint. The output of which can be seen below:

[root@nginx-node nginx]# siege -b -r 1 -c 15 -d 1 http://localhost/nodelay:80
** SIEGE 4.0.2
** Preparing 15 concurrent users for battle.
The server is now under siege...
HTTP/1.1 200     0.00 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 503     0.00 secs:     197 bytes ==> GET  /nodelay:80
HTTP/1.1 503     0.00 secs:     197 bytes ==> GET  /nodelay:80
HTTP/1.1 200     0.20 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 200     0.40 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 200     0.60 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 200     0.80 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 200     1.00 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 200     1.20 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 200     1.40 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 200     1.60 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 200     1.80 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 200     2.00 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 200     2.20 secs:     612 bytes ==> GET  /nodelay:80
HTTP/1.1 200     2.40 secs:     612 bytes ==> GET  /nodelay:80

Transactions:                     13 hits
Availability:                  86.67 %
Elapsed time:                   2.40 secs
Data transferred:               0.01 MB
Response time:                  1.20 secs
Transaction rate:               5.42 trans/sec
Throughput:                     0.00 MB/sec
Concurrency:                    6.50
Successful transactions:          13
Failed transactions:               2
Longest transaction:            2.40
Shortest transaction:           0.00

From the above outputted Siege report, we can see the pattern is consistent with what was illustrated in the diagram. Immediately two requests (15 - (burst value + 1)) are rejected and the remaining 13 are handled in accordance with the set rate of 5r/s or one request every 200 milliseconds.

But now we want to introduce some delay, because we want our user(s) to pull down all the resources of our login page without high latency. So let’s configure a delay of 8, which enforces a delay after 8 excessive requests while still maintaining our rejection policy from above, being anything over incoming 12 requests will be rebuffed (prescribed via our burst parameter).

With this configuration in place, and maintaining our previous request stream of 15 parallel requests, we can expect to see behavior as depicted in the time-series diagram below: 

Again, we notice the same number of rejected requests (2), however the time-frame in which the accepted requests are handled is significantly reduced than previously:

  • The eight are proxied without delay by NGINX Plus
  • Again, 2 requests are rebuffed with a 503 status code
  • The remaining 5 are treated, one by one, at a rate of 5r/s, which aptly allows all five to be accepted after one second

The Siege report allows us to see what’s happening as time elapses between requests:

[root@nginx-node nginx]# siege -b -r 1 -c 15 -d 1 http://localhost/delay:80
** SIEGE 4.0.2
** Preparing 15 concurrent users for battle.
The server is now under siege...
HTTP/1.1 200     0.00 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 200     0.00 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 200     0.00 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 200     0.00 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 200     0.00 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 200     0.00 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 200     0.00 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 200     0.01 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 200     0.01 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 503     0.01 secs:     197 bytes ==> GET  /delay:80
HTTP/1.1 503     0.01 secs:     197 bytes ==> GET  /delay:80
HTTP/1.1 200     0.20 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 200     0.40 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 200     0.60 secs:     612 bytes ==> GET  /delay:80
HTTP/1.1 200     0.80 secs:     612 bytes ==> GET  /delay:80

Transactions:                     13 hits
Availability:                  86.67 %
Elapsed time:                   0.80 secs
Data transferred:               0.01 MB
Response time:                  0.16 secs
Transaction rate:              16.25 trans/sec
Throughput:                     0.01 MB/sec
Concurrency:                    2.55
Successful transactions:          13
Failed transactions:               2
Longest transaction:            0.80
Shortest transaction:           0.00

It should be noted that in both these examples, the impact of completed requests opening up spots in the burst queue is ignored, which is behavior we would expect to see in reality. With each time increment this would of course allow excessive requests to be consumed within the configured burst size. 

Conclusion

When imposing a delay, we saw no difference in the number of rejected requests, but a modification in the shaping of incoming traffic. Instead of queuing excessive requests (exceeding the rate-limit), which occurs when a delay is not enforced, we’re able to deliver important web content immediately and ensure our users don’t experience high latency. This is a powerful tool if we know the number of resources required to pull down, on a page-to-page basis, allowing us to be more surgical when it comes to handling traffic. 

This may seem trivial in a small-scale demo, however when social media and large e-commerce sites are designing their reverse proxying infrastructure for massive load, the advanced features of NGINX rate-limiting can pave way for higher levels of precision as a rate-limit/traffic-policy enforcer. Otherwise legitimate security threats have every potential to harm the performance, and consequently customer experience of the web application. 

This demo made use of this simple NGINX config :

http {
    limit_req_zone $binary_remote_addr zone=ip:10m rate=5r/s;

    server {
        server_name     localhost;
        root /usr/share/nginx/html;
        listen  127.0.0.1:80;
        access_log /var/log/nginx/access.log combined;

        location /delay {
                limit_req zone=ip burst=12 delay=8;
                try_files /index.html =404;
                }

        location /nodelay {
                limit_req zone=ip burst=12;
                try_files /index.html =404;
                }
        }
}