How to Scale Your Cluster to Over 2000 Nodes on AWS EKS

Published in

Nanit Engineering

11 min readMar 11, 2021

In this article we’re going to go over the steps needed to scale an EKS cluster to more than 2k nodes.

Why?

Nanit’s cluster is growing with the company. The number of servers (nodes) we have during peak time exceeds 2000 for some period now. Our system knows how to handle the load increase and is designed to scale with our customers. But we ran into a different issue.

Our AWS Elastic Load Balancers have limitations with the way it works and we recently hit a quota for the number of targets it can direct to. A brief overview of Elastic Load Balancers:

Classic Load Balancer (CLB): Layer 4/ Layer 7 load balancer. This is the default load balancer kubernetes provisions on AWS when you create a service resource with {type: LoadBalancer}. To use a different type you have to do more work.

Application Load Balancer (ALB): Layer 7 load balancer. Has many features since it is aware of the underlying protocol. ALB can be used in multiple ways. We use this type of LoadBalancer where we want to serve HTTP(S) traffic. This load balancer is created using the AWS Load Balancer Controller by provisioning Ingress resources pointing to internal Kubernetes Service resources.

Network Load Balancer (NLB): Layer 4 load balancer. Routes TCP/UDP. Has more delicate features than ALBs and CLBs. We use this type of load balancer when the exposed service does not use HTTP for communication. This load balancer is created using the AWS Load Balancer Controller by annotating the service resource.

Architecture:

Until today we used the default Classic Load Balancers.

The Classic Load Balancer routes traffic to all kubernetes nodes and relies on kube-proxy (usually using iptables/netfilter) to route traffic to pods. This means that it registers all nodes as “instances” (targets). If you have many nodes and reach your quota — the load balancer will not be provisioned, your service will show an error stating it can not register any more targets, and you will not be able to route traffic to your services.

So routing looked like this:

Load Balancer (instance mode) ->
Node's kube-proxy (iptables/netfilter) ->
target pod

The ELB forwards requests to the node’s kube-proxy. Then, kube-proxy is responsible to find the right pod and route the request to the pod. The ELB holds the addresses of all nodes in the cluster.

Application and Network Load Balancers register Target Groups. Each Target Group has a limit of 1,000 targets just like the Classic Load Balancer (for which the hard limit is 2,000 — this is the limit we hit). They are more sophisticated and can be routed by pod ip, not relying on kube-proxy. This only registers the required pods as targets which is a significantly lower number for us, and this solved our problem to continue scaling up our nodes and services.

So the routing is much simpler and looks like this:

Load Balancer (ip-mode) ->
target pod

This is one less network hop in the cluster and also does not rely on kube-proxy being updated by the master api with the routes to pods.

Installation:

To use Application and Network Load Balancers you need to deploy the AWS Load Balancer Controller. This is a simple deployment that creates a pod that listens to cluster events (e.g. the creation of an Ingress resource) and provisions AWS resources in response.

Controller: To deploy the controller we used a helm chart.

First we added the eks-charts repository:

helm repo add eks https://aws.github.io/eks-charts

And then we installed the chart:

helm upgrade -i aws-load-balancer-controller eks/aws-load-balancer-controller — set clusterName=$CLUSTER_NAME)

Permissions: For the controller to work, it has to have route53 access, along with some other permissions. There is a guide on how to set the proper permissions using a Service Account, but it may vary with your setup. We did not use this.

Subnet Discovery: The controller needs to discover subnets for internal/public elbs. In our case we only had to add the public tagging to our subnets — {kubernetes.io/role/elb: 1} since we use the elbs for public traffic, but if you use internal elbs: you should use {kubernetes.io/role/internal-elb: 1}. Subnets must also be tagged with {kubernetes.io/cluster/${cluster-name}: shared} (or owned). The cluster tag is required by K8s cloud-controller-manager so it is highly likely you already have it and no change is needed.

Heads Up: The controller has a broad list of features. Not all features are available in all regions. In our staging environment we hit some restrictions that made the controller not work. Luckily the aws-load-balancer-controller pod logs are very helpful in debugging these problems. Just make sure that if it doesn’t work you go to the pod logs first.

Note: Make sure you go over the readiness gate as it is almost certainly needed in your cluster as well. Since it is not technically a requirement and is a bit more detailed — It’s in the next section.

Pod Readiness Gate

One thing that was immediately obvious to us is that NLB target registration is slow. This seems like an existing “problem” which is not due to a misconfiguration, but the bigger problem that it exposed is that kubernetes considers pods as “ready” much faster than they are registered in load balancers. If target registration is fast it’s usually transparent, but if you think about how kubernetes rolls out a service this is a major problem.

Kubernetes can roll out many pods while many of them (or even all of them) are not registered with the elb (especially NLB since it’s slow). If we have a deployment of 10 pods that takes 3 minutes to roll out, but registering the targets to the elb takes 4 minutes/target — when we roll out this deployment all pods will be replaced while no new pods will be registered(!). This is not a problem when you target pods by instance (meaning using kube-proxy through nodes) since nodes are up during rollout and kube-proxy takes care of it intra-cluster. But it is a problem when you want the elb to target a specific pod (ip-mode).

This is a complete show-stopper for most businesses.

Fortunately, there is a simple solution using a Pod Readiness Gate. Using a pod readiness gate alters the health checks of the pod and changes it so that a pod that isn’t yet registered to the load balancers in a healthy state — isn’t ready from kubernetes’ perspective as well. This results in slow but stable rollouts where a pod that isn’t actually registered to the elb isn’t considered healthy.

To make use of the pod readiness gate, all that is needed is a label for the namespace(s) you’re using:

kubectl label namespace yournamespace elbv2.k8s.aws/pod-readiness-gate-inject=enabled

That’s it. You’re now using the Readiness Gate. What it does is inject the necessary readiness gate configuration to the pod spec via mutating webhook during pod creation. This means that any pod that meets the following:

There exists a service matching the pod labels in the same namespace
There exists at least one target group binding that refers to the matching service
The target-type is IP

Will only be considered ready when it actually registered to the elb.

To see the readiness gates status you can add the `-o wide` flag when listing pods:

kubectl get po … -o wide

Initially the readiness gates should be set to “None” since no pod meets the criteria (we did not create target-type=ip load balancers yet).

Creating Load Balancers

You are now ready to create load balancers for your deployments. There are two types of load balancers (ALB/NLB) and two target-types (instance/ip).

For our use case we wanted ip-based routing directly to the pods to resolve our Classic Load Balancer target limit. Again, this means we point traffic directly to pods (via its ip) and not to nodes (relying on the node’s kube-proxy).

Application Load Balancer

To provision an application load balancer we create an Ingress resource:

This is a minimal ingress that directs all traffic (/*) to a service named “my-service” on port 80. There are many other useful settings. You can do SSL termination, configure health-checks, limit allowed CIDRs and many more.

When you create this ingress resource the AWS Load Balancer Controller will listen to the ingress creation event and provision an Application Load Balancer on AWS. It will also create a TargetGroupBinding on kubernetes and a TargetGroup on AWS (EC2 dashboard).

You can interact with this ingress resource just like any other kubernetes resource:

kubectl get ingress

Just as it is with services, your load balancer address will be under the “address” column.

You should have a log entry in your aws-load-balancer-controller pod:

{“level”:”info”,
“ts”:1612311779.026585,
 “logger”:”controllers.ingress”,
 “msg”:”successfully deployed model”,
 “ingressGroup”:”default/my-ingress”}

More informative log messages will follow (ingress created with spec, target group creation, load balancer creation and creating listeners).

You can also see the TargetGroupBinding resource created:

kubectl get targetgroupbinding

You can see the service and port it points to, the target type (ip) and if you use -o wide you will get its ARN as well.

And the log entry should be:

{“level”:”info”,
 “Ts”:1612362232.167822,
 “Logger”:”controllers.ingress”,
 “Msg”:”created targetGroupBinding”,
 “stackID”:”default/my-ingress”,
 “resourceID”:”default/my-ingress-alb:80",
 “targetGroupBinding”:{“namespace”:”default”,
 “name”:”k8s-default-my-ingress-alb-b31903940e”}}

Network Load Balancer

To provision an application load balancer we created a Service resource:

This is a minimal Network Load Balancer deployment. It will create a network load balancer sending traffic directly to the pods. It will route all traffic on port 80 to pods with the proper selector (somelabel=somevalue) on port 8080.

Network Load Balancers use mandatory passive TCP health-checks. You cannot disable, configure or monitor (via AWS) them. This means they will try and open a socket (nothing more) on the health-check port (defaults to the traffic port) and if they succeed they will count the test as passed. This, however, is a problem if you do not respond very fast. You should make sure the response to passive TCP health-checks is very minimal. You can also immediately close the socket upon connection.

Some of the settings are strict (see ‘Health check settings’) and your aws-load-balancer-controller may report an error if you use an invalid value. An example would be the health-check interval. It can only be in 10 or 30 seconds intervals, and it uses a distributed consensus algorithm so it queries the service much more than its interval — see “HealthCheckIntervalSeconds” under “Health check settings” in the linked page.

Once you created the service it should be available and very shortly have a Network Load Balancer on AWS, a TargetGroupBinding resource on kubernetes and a TargetGroup on AWS.

You can query them like any other service resource.

You will also see pretty similar logs in your AWS Load Balancer Controller to those you saw when the Application Load Balancer was created.

On both Application and Network Load Balancers you should have the status of the readiness gate when you list them with `-o wide` (or describe the pod):

You should also make sure the pods for your deployments are not considered ready before they are registered to their TargetGroup on AWS in “healthy” state. This is crucial for Network Load Balancers where registration is pretty slow.

proxy protocol v2 / headers on socket open

Some of our deployments require the client’s address to operate. To get the client address we use the proxy protocol. Classic Load Balancers have an option to use proxy Protocol V1, but Network Load Balancers (which are the ones where we sometimes need client addresses) only use proxy protocol v2.

To enable the proxy protocol headers we add an annotation to our Service resource:

service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: proxy_protocol_v2.enabled=true

You can alternatively add:

service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: “*”

But it may be nicer to have all target group attributes on one entry.

If you added this annotation to your service resource it will send the proxy protocol v2 header which includes the client’s address information with each request. Make sure you process them accordingly.

Lazy headers

One thing to note is that headers are pushed lazily. This means that if no data is sent from the client after the connection is made — no headers will be sent. To change this behaviour we had to contact AWS since there is an internal limitation. We also had to set:

proxy_protocol_v2.client_to_server.header_placement=on_first_ack

For the Target Group to have the headers sent immediately on connection.

You can specify Target Group attributes using the annotation `aws-load-balancer-target-group-attributes`

This is highly recommended as it will create a Target Group with the right attributes using the aws-load-balancer-controller and you don’t have to do it yourself:

These two attributes will enable proxy protocol and make sure it’s sent eagerly even if no data was sent from the client.

Monitoring and Alerting

Just like any other production system, our system has to be up. If something happens and customers can’t use our service — we want to know. This means we have to monitor all of our services and alert when something bad happens.

To monitor our services we use Grafana and Prometheus. While the AWS Load Balancer Controller does export some metrics — it’s far from what is needed to be able to reliably monitor our systems. The information we do need is in CloudWatch. You could do monitoring and alerting in CloudWatch. One of the downsides of Cloudwatch is that it doesn’t seem to be as powerful as Prometheus.

To pull information from CloudWatch to our prometheus server we used the Prometheus Cloudwatch Exporter. This is a deployment (installation details on repo) that does what the name suggests — exports metrics from CloudWatch to Prometheus by querying CloudWatch at a configurable interval. It takes a YAML configuration file and runs the specified queries and exports them to prometheus just like any other pod exports metrics.

One metric we’re interested in is the percentage of healthy hosts. To get this metric we need the number of healthy hosts divided by the number of total hosts. We tried using the CloudWatch integration but gave up since prometheus seems much more suitable for these tasks.

To pull the relevant metrics from CloudWatch you can use this configuration:

This will instruct the exporter to collect and export the specified metrics to the prometheus server that scrapes it.

Once you have these metrics you can use the following query to know the percentage of healthy hosts:

It formats the metrics nicely by using “label_replace” and accumulates Network and Application Load Balancers so you can set a single alert to both of them if it falls below a certain threshold. The 15 minutes offset is to account for the exporter’s delay.

Summary

As your business grows, you’ll run into scale issues. These types of problems are not always expected and it’s best to deal with them before things break. This post shows you how we allow more than 2,000 nodes in our cluster using Amazon’s newer types of elastic load balancers. It also goes over the issues we ran into so far and how we resolved them (readiness gates, pushing proxy protocol headers). Hopefully others will find it useful.

There are many features to EKS in general and load balancers specifically. We do much more with them than what we’ve shown here for this specific issue. Getting to know their limits, as well as their benefits — can help your cluster stay healthy and grow steadily.