How I setup Autoscaling on Kubernetes and you can too!

Arjun Sunil
7 min readJan 19, 2020
Bunny in the hole

TLDR; Check the linked infographic below.

Kubernetes is one big rabbit hole and anyone getting started with it can take a while getting around the concepts and abstractions it has. Although it seems daunting at first once the concepts are abstracted well enough, getting around managing infrastructure using Kubernetes is a breeze.

Okay, first things first; a brief about me. I am an ML Engineer who recently started working full-time on MLOps. Like quite a few people out there, I read up tonnes of blogs and learn by executing. This post talks about everything I picked up while doing my first task; setting up Autoscaling on Kubernetes.

To understand all of this in its entirety, here are a few pre-requisites you would need to know. For brevity, I will stick to one line explanation and redirect you to resources I used to understand the basics.

  • Nodes: Machines that make up the Kubernetes cluster
  • Pods: Tiny residents of the cluster, can be called synonymous to Containers.
  • Containers: Your code resides here inside Docker containers. These run inside the pods I mentioned above.
  • Deployments: Kubernetes resources that manage most Pods on the cluster.
  • Resources/Requests: The minimum required CPU/Memory for a pod to run
  • Resources/Limits: The maximum allowed CPU/Memory for a pod to run

To read up more about these head over to this amazing Kubernetes 101 writeup with great visuals.

There are many benefits of using Kubernetes, but in this write-up, I will be talking mostly around how I set up autoscaling and how you could do yourself too!

Kubernetes can be easily scaled manually. But for it to shine and to show it’s true colors (and to reduce one overhead from Ops) autoscaling would be the way to go. When Kubernetes autoscaling is seen in action, it seems magical (unlike Jesse Pinkman here).


Must-haves for Kubernetes autoscaling

Well, before we get started let us ensure one thing.

Resource min/max bounds must be defined for all pods/deployments

Resource Bounds

Here’s what a normal deployment looks like when resource bounds aren’t defined:

resources: {}  
# limits:
# cpu: 128m
# memory: 128Mi
# requests:
# cpu: 100m
# memory: 64Mi

and, here’s what it looks like when resource bounds have been defined

cpu: 128m
memory: 128Mi
cpu: 100m
memory: 64Mi


Limits define the upper bound for the memory and CPU a single pod can use. If these are not set, the pod is free to consume as much CPU and memory is available on the host node; this is dangerous as I’ve noticed, many times having undefined limits causes a node to go into NotReady state. It tends to happen when a pod gets overloaded and asks the host machine to give it more computational power than what it has.


As the name suggests, this is the amount of CPU and Memory that a pod/deployment requests for.

The Kubernetes scheduler factors this in when trying to schedule a new pod to a node and if a candidate node doesn’t have the requested memory and CPU, the pod will not get scheduled to the node.

Why are Requests/Limits crucial?

To understand why limits and resources are important, think of CPU utilization as air. If hard limits aren’t set; the pod inside the node (in this case, the air inside the balloon) is free to expand its utilization as much as it wants. When that happens…

Node Status: NotReady

well, the node goes to NotReady and all pods running in the Node go down. When a node and the pods in it go down, it is a serious issue since your services will go down as well in that case; although deployments ensure desired number of pods stay running at all times by creating a new pod for every pod that is no longer running, it takes a while for the pod to come up and reach a state where it can handle requests.

Now let’s consider a scenario where proper requests and limits are set. In this case, imagine the CPU utilization to be water. The hard glass walls resource as limits. The most CPU/memory a pod(water) is allowed to consume is the hard boundary limit (in this example the glass walls).

Setting hard limits.

Metrics-server needs to be installed on kubernetes

Although autoscaling has its resource definitions in Kubernetes, at the heart of autoscaling is metrics-server. The metrics server continuously pools the CPU/Memory utilization of all nodes and pods which is used to calculate if the latter is at peak load and needs to be scaled out.

To check if the metrics-server is installed or not, just run the aforementioned command and check if it returns the CPU/Memory utilization.

kubectl top pods
kubectl top nodes

If it returns an error as shown below

Error from server (NotFound): the server could not find the requested resource (get services http:heapster:)

follow the instructions here, to get it running.

Autoscaling Concepts

Autoscaling in Kubernetes can refer to horizontal scaling of pods and nodes or vertical scaling of pods.

Pod Scaling

Horizontal Pod Autoscaler(HPA)

HPA scales the number of pods available in the cluster. HPA generally scales based on a target threshold which can be based on average/current CPU/memory utilization.

Horizontal Pod Autoscaler

How it works:

  1. HPA checks metrics every 30 seconds (default)
  2. If CurrentMetricValue > targetThreshold: update ReplicaCount
  3. Deployment Controller/Replica Controller Rolls out the updated number of pods
  4. Repeat from Step 1.

Must remember factors when using HPA

  • Cooldown period: The Kubernetes HPA has a cooldown period that allows metrics to stabilize before triggering the next scale up or scale down event. The cooldown periods are as follows:
  • 30-second default metrics check interval
  • 3-minute wait after the last scale-up
  • 5-minute wait after the last scale down
  • HPA tends to work best with deployment objects over Replication Controllers.
  • HPA by default works on CPU/Memory utilization, can be configured to use custom metrics like queries per minute (for DB scaling) or requests per minute (for web server scaling)
  • Average CPU/Memory utilization is available with apiVersion: autoscaling/v2beta1

Setting Utilization Targets

One important factor to consider when setting up target utilization, the more conservative target utilization is, the more headroom your cluster has available when scaling up.

Let’s talk about it with an example;

targetCpuUtilizationPercentage: 80

The moment our deployment hits a utilization of 81%, it starts scaling out. The headroom available is 20%, which means while the pods are scaling up, your pods still have a 20% CPU available for it to use while the new pods get scheduled and get ready to handle requests.

If we stay conservative and set 70% as the target, there is more headroom(and therefore more time) available for the cluster to keep serving requests while the pods get scaled up. The Nginx helm chart has set 50% as default! But make sure to consider how long your pods take to come up once the replica count has been updated and reach the Ready state.

Deciding upon the target is a balancing act between optimal cluster utilization and finding the most optimal

Vertical Pod Autoscaler (VPA)

VPA as the name suggests scales the pod’s resource requests. This means if a pod is at load and crossing a threshold (of let’s say 70% CPU based on the limit), VPA will automatically increase the requests to accommodate higher CPU requirements.

Vertical Pod Autoscaler

How it works:

  1. VPA checks metrics every 10 seconds (default)
  2. If CurrentMetricValue > targetThreshold: update resources: limits
  3. Pods restarted for new resources to reflect
  4. Repeat from Step 1

Factors to consider when using VPA

  • Resource changes won’t reflect in pods without pod restarts
  • VPA changes requests, it doesn’t change limits
  • It doesn’t work well if HPA is rolled out for the same deployment
  • VPA can be used if HPA uses custom/external metrics
  • Still in beta, to know more about limitations refer here and watch this space for future work

Node Scaling

Cluster Autoscaler (CA)

Cluster autoscaler scales the number of nodes in a cluster based on the number of pods pending scheduling.

Cluster Autoscaler

How it works:

  1. CA checks for pods in pending state
  2. CA requests the cloud for provisioning node
  3. Node gets provisioned by cluster
  4. Node joins the cluster
  5. Pending pods get scheduled to newly provisioned node

Factors to consider when using CA

  • Pods can go into a pending scheduling state when the scheduler doesn’t find any node that has the resources requested by the pod.
  • When pods are horizontally scaled out, new replicas can also be in the pending scheduling state.

Getting hands dirty

Now that you know the concepts around autoscaling, head over to the repository mentioned to deploy your Docker container in Kubernetes with Autoscaling!

Create your own Kubernetes Deployment with HPA enabled, the easy way:

Disclaimer: Up until 3 months ago, I had never worked on Kubernetes, sure I had read about the concepts about how it works but I didn’t have actual hands-on with the platform before I joined Fynd. Feel free to correct me if I got something wrong in the comments section below :)

Special thanks go to Pradeep Tiwari, Neeraj Shukla, Gaurav Gola for giving me a chance to work on this amazing platform and grooming me during the initial phase.



Arjun Sunil

I talk about real world experiences in Tech and Scaling Deep Learning based workloads | Reach out via @arjun921 /