Horizontal Autoscaling with Kubernetes HPA

Want to handle colossal traffic like a pro? Or just dreaming big? Either way, you need to think about scaling.

Does this mean I just buy more servers?

Yes and no. Or to be more specific: No, not in the traditional sense. You don't buy more servers, but you give your cloud service provider clearance for allocating more instances of your application as demand grows. And you don't do it manually, you automate it. Kubernetes HPA is a great option in this case. Let's take a look.

Your scaling options

In Kubernetes there are three flavors you can choose from when scaling your application:

Cluster Autoscaler
Vertical Pod Autoscaler (VPA)
Horizontal Pod Autoscaler (HPA)

With Cluster Autoscaler you'd set up a node pool with boundaries for the number of nodes. Scaling is then executed on the node level and based on your workloads resource requests.

VPA and HPA are workload level scalers instead. Vertical Pod Autoscaler adjusts the resource requests for your pods based on resourcePolicys. One way of utilizing VPA is to give you a recommendation on how to set up your pods resource requests. Note that VPA is not built into Kubernetes by default, but you can install it as an add-on. Of course, your favorite cloud provider will have it already provisioned for you.

Since the complexity of Cluster Autoscaling can be abstracted away, and we assume you've already optimized and know your pods' resource requests, we'll focus on horizontal scaling today.

Enter the Kubernetes Horizontal Pod Autoscaler

The Horizontal Pod Autoscaler manages the scale of replicas (simplified: how often your app is running in parallel) based on metrics you define. The most common metrics are the CPU and memory utilization of your workloads. The HPA get's these values from the Metrics API.

Example: If you tell the HPA to keep an eye on the CPU utilization of your pods, and they are starting to exceed a certain threshold on average, the HPA will spin up more instances of your app to account for the increased load. The same goes for scaling down when the load decreases. You can define the threshold and a lot more as per your requirements.

Cool hint for our control systems engineering friends: you can also set a hysteresis value to avoid a "flapping" effect. This can be very helpful if you have a load that's fluctuating a lot. You can even define scaling rate limits.

HPA Demonstration

Enough talk, let's see the HPA in action! To keep things simple, we'll choose a nginx Deployment for our load test example and flange mount an HPA on it:

loadtest.yaml

# ... other parts of the Deployment
resources:
  limits:
    cpu: 100m
    memory: 128Mi
  requests:
    cpu: 100m
    memory: 128Mi
 
# ... other parts of HPA
metrics:
  - resource:
      name: cpu
      target:
        averageUtilization: 50

We'll grant the HPA room to scale up our workload to 5 replicas, starting from only one for this test. But don't think you can cheap out on replicas and run only one replica in prod please :) As always we're applying GitOps principles, so all that it takes for us to throw this deployment in our cluster is a simple git push.

After that wait a short moment for our controller to reconcile the changes, and... aha, here it is, happily up and running:

NAME           REFERENCE                 TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
loadtest-k8s   Deployment/loadtest-k8s   0%/50%    1         5         1          1m

"Ok, what's next?" I hear you ask. You know what's next, it's time to load our request gun. Are you ready? Let's go:

Running 3m test @ http://172.30.6.88:32035
  1 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    43.24ms   61.19ms 695.29ms   88.07%
    Req/Sec     2.91k     2.63k   14.27k    84.31%
  520807 requests in 3.00m, 423.67MB read
  Socket errors: connect 0, read 0, write 2, timeout 0
Requests/sec:   2895.50
Transfer/sec:      2.36MB

Not bad, huh? Now let's inspect if our deployment survived the blast, and if so, what the HPA did:

Observe how the wave of requests fill up the the processing capabilities of the server up to the point where it will ask for more. In our case, this is when the CPU utilization exceeds 50% on average across all pods. The HPA will then start to scale up the number of replicas to handle the load.

Observe the delay between the the load and the CPU utilization, as well as the delay between the CPU utilization and the HPA upscaling. Also see how the HPA keeps the pods running quite a bit after the load has decreased. By configuration you can shape the systems response to some extend. Also note that scaling takes time as containers have to be created, images have to be fetched (you can stream them to save time!), etc.

Please note that the data was resampled and averaged in 5s bins (e.g. there is no such thing as fractions of replicas). The x-axis is non-linear. This is not an expirement that you'd consider the "pure doctrine of science", so take the results with a grain of salt. But it should give us a good idea of how the HPA behaves in a real-world scenario.

I love it when a plan comes together!

Not only Colonel John "Hannibal" Smith loves it when a plan comes together, we also do. Kudos to the HPA, our app is now as flexible as a yoga master. It stretches and shrinks to match the load, keeping everything zen and efficient :) Now that you've seen the power of the HPA, go out and use it! Or wait, there's more we like to tell you...

More great benefits of HPA

Did you take note how in our experiment the number of replicas capped at the maximum of our set 5? You know why this is a good thing? Unlike the wild west of serverless where costs may grow beyond all measure, Kubernetes with HPA lets you define hard limits for your resources. This means you can predict your cost regarding this part of your operations with the precision of a Swiss watch (almost). It should be noted though, that "unlimited" spend can be a problem on every cloud service provider as scaling anything is at best soft-capped by default (think billing alerts) and then gets combined with a pay-as-you-go model.

But whatever, here's another benefit, just because we like you: rolling out new versions of your app is a breeze with HPA. Whenever you're pushing an update and you're running more than one replica, Kubernetes will gradually roll it out, so you can deploy without downtime drama. It's almost like having a personal assistant for your deployment roll-outs!

Summary

That's all for now. We explained the different scaling options available in Kubernetes and focused on the Horizontal Pod Autoscaler. We showed you the benefits of scaling your app with the HPA and how it behaves under load. We hope you enjoyed this article and learned something new.

Until next time, happy scaling!

FAQ

If you need help setting up scaling for your applications or increase availability of your deployments, feel free to contact us. We're here to help!

Horizontal Autoscaling with Kubernetes HPA

Your scaling options

Enter the Kubernetes Horizontal Pod Autoscaler

HPA Demonstration

More great benefits of HPA

Summary

FAQ

Can I use the HPA with custom metrics?

Can I use the HPA and the VPA at the same time?

How can the HPA save me cost?

Can I use the HPA together with burstable instances?