How I setup Autoscaling on Kubernetes and you can too!

Bunny in the hole
  • Nodes: Machines that make up the Kubernetes cluster
  • Pods: Tiny residents of the cluster, can be called synonymous to Containers.
  • Containers: Your code resides here inside Docker containers. These run inside the pods I mentioned above.
  • Deployments: Kubernetes resources that manage most Pods on the cluster.
  • Resources/Requests: The minimum required CPU/Memory for a pod to run
  • Resources/Limits: The maximum allowed CPU/Memory for a pod to run

Must-haves for Kubernetes autoscaling

Well, before we get started let us ensure one thing.

Resource min/max bounds must be defined for all pods/deployments

Resource Bounds

Here’s what a normal deployment looks like when resource bounds aren’t defined:

resources: {}  
# limits:
# cpu: 128m
# memory: 128Mi
# requests:
# cpu: 100m
# memory: 64Mi
cpu: 128m
memory: 128Mi
cpu: 100m
memory: 64Mi


Limits define the upper bound for the memory and CPU a single pod can use. If these are not set, the pod is free to consume as much CPU and memory is available on the host node; this is dangerous as I’ve noticed, many times having undefined limits causes a node to go into NotReady state. It tends to happen when a pod gets overloaded and asks the host machine to give it more computational power than what it has.


As the name suggests, this is the amount of CPU and Memory that a pod/deployment requests for.

Why are Requests/Limits crucial?

To understand why limits and resources are important, think of CPU utilization as air. If hard limits aren’t set; the pod inside the node (in this case, the air inside the balloon) is free to expand its utilization as much as it wants. When that happens…

Node Status: NotReady
Setting hard limits.

Metrics-server needs to be installed on kubernetes

Although autoscaling has its resource definitions in Kubernetes, at the heart of autoscaling is metrics-server. The metrics server continuously pools the CPU/Memory utilization of all nodes and pods which is used to calculate if the latter is at peak load and needs to be scaled out.

kubectl top pods
kubectl top nodes
Error from server (NotFound): the server could not find the requested resource (get services http:heapster:)

Autoscaling Concepts

Autoscaling in Kubernetes can refer to horizontal scaling of pods and nodes or vertical scaling of pods.

Pod Scaling

Horizontal Pod Autoscaler(HPA)

HPA scales the number of pods available in the cluster. HPA generally scales based on a target threshold which can be based on average/current CPU/memory utilization.

Horizontal Pod Autoscaler
  1. HPA checks metrics every 30 seconds (default)
  2. If CurrentMetricValue > targetThreshold: update ReplicaCount
  3. Deployment Controller/Replica Controller Rolls out the updated number of pods
  4. Repeat from Step 1.

Must remember factors when using HPA

  • Cooldown period: The Kubernetes HPA has a cooldown period that allows metrics to stabilize before triggering the next scale up or scale down event. The cooldown periods are as follows:
  • 30-second default metrics check interval
  • 3-minute wait after the last scale-up
  • 5-minute wait after the last scale down
  • HPA tends to work best with deployment objects over Replication Controllers.
  • HPA by default works on CPU/Memory utilization, can be configured to use custom metrics like queries per minute (for DB scaling) or requests per minute (for web server scaling)
  • Average CPU/Memory utilization is available with apiVersion: autoscaling/v2beta1

Setting Utilization Targets

One important factor to consider when setting up target utilization, the more conservative target utilization is, the more headroom your cluster has available when scaling up.

Vertical Pod Autoscaler (VPA)

VPA as the name suggests scales the pod’s resource requests. This means if a pod is at load and crossing a threshold (of let’s say 70% CPU based on the limit), VPA will automatically increase the requests to accommodate higher CPU requirements.

Vertical Pod Autoscaler
  1. VPA checks metrics every 10 seconds (default)
  2. If CurrentMetricValue > targetThreshold: update resources: limits
  3. Pods restarted for new resources to reflect
  4. Repeat from Step 1

Factors to consider when using VPA

  • Resource changes won’t reflect in pods without pod restarts
  • VPA changes requests, it doesn’t change limits
  • It doesn’t work well if HPA is rolled out for the same deployment
  • VPA can be used if HPA uses custom/external metrics
  • Still in beta, to know more about limitations refer here and watch this space for future work

Node Scaling

Cluster Autoscaler (CA)

Cluster autoscaler scales the number of nodes in a cluster based on the number of pods pending scheduling.

Cluster Autoscaler
  1. CA checks for pods in pending state
  2. CA requests the cloud for provisioning node
  3. Node gets provisioned by cluster
  4. Node joins the cluster
  5. Pending pods get scheduled to newly provisioned node

Factors to consider when using CA

  • Pods can go into a pending scheduling state when the scheduler doesn’t find any node that has the resources requested by the pod.
  • When pods are horizontally scaled out, new replicas can also be in the pending scheduling state.

Getting hands dirty

Now that you know the concepts around autoscaling, head over to the repository mentioned to deploy your Docker container in Kubernetes with Autoscaling!



I talk about real world experiences in Tech and Scaling Deep Learning based workloads | Reach out via @arjun921 /

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Arjun Sunil

I talk about real world experiences in Tech and Scaling Deep Learning based workloads | Reach out via @arjun921 /