A few weeks ago, I went through how I started setting up some of my services to scale from 0 pods. And it actually works super well overall, and I’m pretty happy about it except for one thing. It took about 3 to 3.5 seconds in order to get that working and I was really wondering if I could do better. If I had to guess, a page loading in 3 seconds for most people would seem totally normal. Especially since once the page is loaded, everything else should operate very quickly. But to me, 3 seconds was a little too long.
That being said, my experiments here didn’t actually pan out very well, but I still wanted to share what I learned. Not every newsletter post has to be a success, right? Especially not when I try to remind myself that this newsletter is more like a public journal.
So what’s the plan? How do I make sure that I scale from 0 more quickly?
Step 1: Make sure that there’s some compute already available
Most of my applications are only configured to use 100 millicores of CPU. That means that we’re really only using 1/10th of a single CPU. So keeping a little headroom around just helps guarantee that the existing pod allocations are leaving enough room on a single node to quickly burst up with some CPU. I use the kubernetes/descheduler to help consolidate and bin pack pods, and since I try to keep nodes as close to 90% utilization at all times, making sure that we don’t overly scale in is big for minimizing scale from 0 time. If I didn’t have any room at all, then the new pod would have to wait for a new node which would take a few minutes.
So what do I do? I schedule some headroom. The concept of headroom is extremely easy. Create a new PriorityClass in your Kubernetes cluster, with a low value of priority, and then schedule an empty pod with the amount of compute that you want to be reserved. If someone makes a request to my scale from 0 service, that pod gets immediately scheduled and the headroom pod with it’s lower priority will get rescheduled at the cluster’s earliest convenience. It’s basically just preallocating the CPU and doing it in a way that makes sure it’s being accounted for in any bin packing logic. Here’s the exact headroom configuration that I use:
apiVersion: scheduling.k8s.io/v1
description: Priority class used by headroom.
globalDefault: false
kind: PriorityClass
metadata:
name: headroom
value: -1
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: headroom
spec:
replicas: 1
selector:
matchLabels:
run: headroom
template:
metadata:
labels:
run: headroom
spec:
containers:
- image: k8s.gcr.io/pause
name: reserve-resources
resources:
requests:
cpu: 1
memory: 2Gi
priorityClassName: headroom
This is extremely simple. The used pause container is an application that does nothing and just doesn’t exit. Defining a priority class is extremely straight forward also.
Step 2: Pre-pull your images
One of the steps for scheduling the pod from 0 is the need to pull the new image to the node that’s going to run it. Depending on your image size, this can take a while. So one thing that I tried experimenting with was pulling the images BEFORE needing to actually schedule any pods, and then making sure that my imagePullPolicy
on the containers was set to IfNotPresent
. So I did some internet searching to find ways to pull images. I had the idea of maybe just running a docker in docker daemonset that would pull images. I’m pretty sure that this would work but it felt like a maintenance burden that I just didn’t want to deal with. Eventually, I stumbled upon a Reddit post that talked about OpenKruise and its ImagePullJob functionality.
You might be wondering: “But Aaron, didn’t you want LESS maintenance burden?”. I did, but seeing something shiny and new made up for that maintenance burden. OpenKruise has a bunch of other functionality for what feels like native ways to extend and improve a Kubernetes cluster, but I was installing it purely to test out the ImagePullJob.
There was a catch with getting this setup though. There’s no ability for OpenKruise to pull images using the AWS STS token that’s provided by the EKS control plane. This was pretty annoying because it meant that I needed to manually create a docker login token to be used, and I needed to find a way to keep it updated. That’s when I came across this post about running a CronJob to update and save the token. I yoinked the entire implementation and set it up to work in my cluster, along with installing OpenKruise. The entire configuration ended up being about 160 lines in terraform which sounds like a lot but it’s really quite straight forward:
locals { | |
kruise_ecr_token_updater_service_account = "kruise-ecr-token-updater" | |
kruise_ecr_token_secret_name = "kruise-ecr-token" | |
kruise_ecr_token_updater_script = <<EOF | |
ECR_TOKEN=`aws ecr get-login-password --region $${AWS_REGION}` | |
NAMESPACE_NAME=${kubernetes_namespace.kruise_system.metadata[0].name} | |
kubectl delete secret --ignore-not-found $DOCKER_SECRET_NAME -n $NAMESPACE_NAME | |
kubectl create secret docker-registry $DOCKER_SECRET_NAME \ | |
--docker-server=https://$${AWS_ACCOUNT}.dkr.ecr.$${AWS_REGION}.amazonaws.com \ | |
--docker-username=AWS \ | |
--docker-password="$${ECR_TOKEN}" \ | |
--namespace=$NAMESPACE_NAME | |
echo "Secret was successfully updated at $(date)" | |
EOF | |
} | |
resource "kubernetes_namespace" "kruise_system" { | |
metadata { | |
name = "kruise-system" | |
} | |
} | |
data "aws_iam_policy" "ecr_read_only" { | |
arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly" | |
} | |
module "kruise_ecr_token_updater_irsa" { | |
source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks" | |
version = "~> 5.0" | |
create_role = true | |
role_name = "kruise-ecr-token-updater-${local.cluster_name}" | |
role_policy_arns = { | |
ecr_read_only = data.aws_iam_policy.ecr_read_only.arn | |
} | |
oidc_providers = { | |
irsa = { | |
provider_arn = module.eks-red.oidc_provider_arn | |
namespace_service_accounts = ["${kubernetes_namespace.kruise_system.metadata[0].name}:${local.kruise_ecr_token_updater_service_account}"] | |
} | |
} | |
} | |
resource "kubernetes_service_account" "kruise_ecr_token_updater" { | |
metadata { | |
name = local.kruise_ecr_token_updater_service_account | |
namespace = kubernetes_namespace.kruise_system.metadata[0].name | |
annotations = { | |
"eks.amazonaws.com/role-arn" = module.kruise_ecr_token_updater_irsa.iam_role_arn | |
} | |
} | |
} | |
resource "kubernetes_role" "kruise_ecr_token_updater" { | |
metadata { | |
name = "kruise-ecr-token-updater" | |
namespace = kubernetes_namespace.kruise_system.metadata[0].name | |
} | |
rule { | |
api_groups = [""] | |
resources = ["secrets"] | |
resource_names = [local.kruise_ecr_token_secret_name] | |
verbs = ["delete"] | |
} | |
rule { | |
api_groups = [""] | |
resources = ["secrets"] | |
verbs = ["create"] | |
} | |
} | |
resource "kubernetes_role_binding" "kruise_ecr_token_updater" { | |
metadata { | |
name = "kruise-ecr-token-updater" | |
namespace = kubernetes_namespace.kruise_system.metadata[0].name | |
} | |
role_ref { | |
api_group = "rbac.authorization.k8s.io" | |
kind = "Role" | |
name = kubernetes_role.kruise_ecr_token_updater.metadata[0].name | |
} | |
subject { | |
kind = "ServiceAccount" | |
name = kubernetes_service_account.kruise_ecr_token_updater.metadata[0].name | |
namespace = kubernetes_service_account.kruise_ecr_token_updater.metadata[0].namespace | |
} | |
} | |
resource "kubernetes_config_map" "kruise_ecr_token_updater" { | |
metadata { | |
name = "kruise-ecr-token-updater" | |
namespace = kubernetes_namespace.kruise_system.metadata[0].name | |
} | |
data = { | |
AWS_ACCOUNT = data.aws_caller_identity.current.account_id | |
AWS_REGION = data.aws_region.current.name | |
DOCKER_SECRET_NAME = local.kruise_ecr_token_secret_name | |
} | |
} | |
resource "kubernetes_cron_job_v1" "kruise_ecr_token_updater" { | |
metadata { | |
name = "kruise-ecr-token-updater" | |
namespace = kubernetes_namespace.kruise_system.metadata[0].name | |
} | |
spec { | |
schedule = "0 */10 * * *" | |
job_template { | |
metadata {} | |
spec { | |
template { | |
metadata {} | |
spec { | |
service_account_name = kubernetes_service_account.kruise_ecr_token_updater.metadata[0].name | |
container { | |
name = "kruise-ecr-token-updater" | |
image = "odaniait/aws-kubectl:latest" | |
command = [ | |
"/bin/sh", | |
"-c", | |
local.kruise_ecr_token_updater_script | |
] | |
env_from { | |
config_map_ref { | |
name = kubernetes_config_map.kruise_ecr_token_updater.metadata[0].name | |
} | |
} | |
} | |
} | |
} | |
} | |
} | |
} | |
} | |
resource "helm_release" "openkruise" { | |
name = "kruise" | |
namespace = "kube-system" | |
repository = "https://openkruise.github.io/charts/" | |
chart = "kruise" | |
version = "1.3.0" | |
reset_values = true | |
set { | |
name = "installation.namespace" | |
value = kubernetes_namespace.kruise_system.metadata[0].name | |
} | |
set { | |
name = "installation.createNamespace" | |
value = false | |
} | |
} |
So we create a ServiceAccount that can write secrets to the EKS control plane, and then once every 10 hours, we fetch a new secret from ECR and then store it. That token will be available for OpenKruise to use, so then we install OpenKruise itself.
OpenKruise will run partially as a DaemonSet which is what I was already thinking of doing.
Now, in order to use the ImagePullJob, I simply added the following manifest to the kustomize manifests for each application:
apiVersion: apps.kruise.io/v1alpha1 | |
kind: ImagePullJob | |
metadata: | |
name: sudokurace | |
namespace: kruise-system | |
spec: | |
image: '911907402684.dkr.ecr.us-west-2.amazonaws.com/sudokurace:' # Tag set by kustomization.yaml | |
pullSecrets: | |
# Must match https://gist.github.com/abatilo/6b287265d541d06da567893c1522999f#file-imagepulljob-tf-L3 | |
- kruise-ecr-token |
Every time the image tag changed, which it would for every deployment, a new ImagePullJob would get scheduled and OpenKruise would download the specified image in for each node in the cluster. Now I wouldn’t have to wait for the images to be downloaded on that node whenever a page was requested.
Step 3: Reduce the polling interval for pod readiness
As I outline in the scale to 0 post, I settled on using the keda/http-add-on for scaling from 0. And this step is super boring because all I did for this one was set the DeploymentCachePollIntervalMS to a lower number. By default, it checks to see if the deployment has new pods once every 250 milliseconds. I dropped that to 100 milliseconds so that we’d notice more quickly once there was a pod ready to take traffic.
Wrapping up
I was really excited to get OpenKruise to work and start doing the ImagePullJobs. Unfortunately, or maybe fortunately, my images are all extremely tiny. They’re statically linked Go binaries and even though I include the entirety of each set of static assets for the website, the containers only end up being about 30-40 MB even while uncompressed. These take milliseconds for my EKS nodes to pull from ECR. They’re so fast that even after having the images pre-pulled, the containers basically didn’t come up any faster at all. Maybe in the 10s of milliseconds. The ImagePullJobs are also not retroactive, in the sense that they execute one time, at time of image tag update, but if new nodes come up, the image isn’t pulled there. Since I run everything on EC2 spot nodes, my nodes churn fairly often, which nullified the whole point of being able to pre-pull.
What really helped with scale from 0 time ended up being just reducing that polling interval. The containers now get scheduled and respond in about 2.5 seconds, so we’re shaving upwards of a full second off from the page load. Does it really make a difference? No, probably not, but it was fun to try regardless. I think if I had larger images like a Python or Node application, that the ImagePullJob would still be pretty worth it. Alas, that’s not the situation I’m in.
See you for the next one. Thanks for reading