Trying to make my scale from 0 faster

Getting a taste of OpenKruise

Mar 14, 2023

A few weeks ago, I went through how I started setting up some of my services to scale from 0 pods. And it actually works super well overall, and I’m pretty happy about it except for one thing. It took about 3 to 3.5 seconds in order to get that working and I was really wondering if I could do better. If I had to guess, a page loading in 3 seconds for most people would seem totally normal. Especially since once the page is loaded, everything else should operate very quickly. But to me, 3 seconds was a little too long.

That being said, my experiments here didn’t actually pan out very well, but I still wanted to share what I learned. Not every newsletter post has to be a success, right? Especially not when I try to remind myself that this newsletter is more like a public journal.

So what’s the plan? How do I make sure that I scale from 0 more quickly?

Step 1: Make sure that there’s some compute already available

Most of my applications are only configured to use 100 millicores of CPU. That means that we’re really only using 1/10th of a single CPU. So keeping a little headroom around just helps guarantee that the existing pod allocations are leaving enough room on a single node to quickly burst up with some CPU. I use the kubernetes/descheduler to help consolidate and bin pack pods, and since I try to keep nodes as close to 90% utilization at all times, making sure that we don’t overly scale in is big for minimizing scale from 0 time. If I didn’t have any room at all, then the new pod would have to wait for a new node which would take a few minutes.

So what do I do? I schedule some headroom. The concept of headroom is extremely easy. Create a new PriorityClass in your Kubernetes cluster, with a low value of priority, and then schedule an empty pod with the amount of compute that you want to be reserved. If someone makes a request to my scale from 0 service, that pod gets immediately scheduled and the headroom pod with it’s lower priority will get rescheduled at the cluster’s earliest convenience. It’s basically just preallocating the CPU and doing it in a way that makes sure it’s being accounted for in any bin packing logic. Here’s the exact headroom configuration that I use:

apiVersion: scheduling.k8s.io/v1
description: Priority class used by headroom.
globalDefault: false
kind: PriorityClass
metadata:
  name: headroom
value: -1
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: headroom
spec:
  replicas: 1
  selector:
    matchLabels:
      run: headroom
  template:
    metadata:
      labels:
        run: headroom
    spec:
      containers:
      - image: k8s.gcr.io/pause
        name: reserve-resources
        resources:
          requests:
            cpu: 1
            memory: 2Gi
      priorityClassName: headroom

This is extremely simple. The used pause container is an application that does nothing and just doesn’t exit. Defining a priority class is extremely straight forward also.

Step 2: Pre-pull your images

One of the steps for scheduling the pod from 0 is the need to pull the new image to the node that’s going to run it. Depending on your image size, this can take a while. So one thing that I tried experimenting with was pulling the images BEFORE needing to actually schedule any pods, and then making sure that my imagePullPolicy on the containers was set to IfNotPresent. So I did some internet searching to find ways to pull images. I had the idea of maybe just running a docker in docker daemonset that would pull images. I’m pretty sure that this would work but it felt like a maintenance burden that I just didn’t want to deal with. Eventually, I stumbled upon a Reddit post that talked about OpenKruise and its ImagePullJob functionality.

You might be wondering: “But Aaron, didn’t you want LESS maintenance burden?”. I did, but seeing something shiny and new made up for that maintenance burden. OpenKruise has a bunch of other functionality for what feels like native ways to extend and improve a Kubernetes cluster, but I was installing it purely to test out the ImagePullJob.

There was a catch with getting this setup though. There’s no ability for OpenKruise to pull images using the AWS STS token that’s provided by the EKS control plane. This was pretty annoying because it meant that I needed to manually create a docker login token to be used, and I needed to find a way to keep it updated. That’s when I came across this post about running a CronJob to update and save the token. I yoinked the entire implementation and set it up to work in my cluster, along with installing OpenKruise. The entire configuration ended up being about 160 lines in terraform which sounds like a lot but it’s really quite straight forward:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

Show hidden characters

	locals {
	kruise_ecr_token_updater_service_account = "kruise-ecr-token-updater"
	kruise_ecr_token_secret_name = "kruise-ecr-token"

	kruise_ecr_token_updater_script = <<EOF
	ECR_TOKEN=`aws ecr get-login-password --region $${AWS_REGION}`
	NAMESPACE_NAME=${kubernetes_namespace.kruise_system.metadata[0].name}

	kubectl delete secret --ignore-not-found $DOCKER_SECRET_NAME -n $NAMESPACE_NAME
	kubectl create secret docker-registry $DOCKER_SECRET_NAME \
	--docker-server=https://$${AWS_ACCOUNT}.dkr.ecr.$${AWS_REGION}.amazonaws.com \
	--docker-username=AWS \
	--docker-password="$${ECR_TOKEN}" \
	--namespace=$NAMESPACE_NAME
	echo "Secret was successfully updated at $(date)"
	EOF
	}

	resource "kubernetes_namespace" "kruise_system" {
	metadata {
	name = "kruise-system"
	}
	}

	data "aws_iam_policy" "ecr_read_only" {
	arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
	}

	module "kruise_ecr_token_updater_irsa" {
	source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
	version = "~> 5.0"

	create_role = true
	role_name = "kruise-ecr-token-updater-${local.cluster_name}"

	role_policy_arns = {
	ecr_read_only = data.aws_iam_policy.ecr_read_only.arn
	}

	oidc_providers = {
	irsa = {
	provider_arn = module.eks-red.oidc_provider_arn
	namespace_service_accounts = ["${kubernetes_namespace.kruise_system.metadata[0].name}:${local.kruise_ecr_token_updater_service_account}"]
	}
	}
	}

	resource "kubernetes_service_account" "kruise_ecr_token_updater" {
	metadata {
	name = local.kruise_ecr_token_updater_service_account
	namespace = kubernetes_namespace.kruise_system.metadata[0].name

	annotations = {
	"eks.amazonaws.com/role-arn" = module.kruise_ecr_token_updater_irsa.iam_role_arn
	}
	}
	}

	resource "kubernetes_role" "kruise_ecr_token_updater" {
	metadata {
	name = "kruise-ecr-token-updater"
	namespace = kubernetes_namespace.kruise_system.metadata[0].name
	}

	rule {
	api_groups = [""]
	resources = ["secrets"]
	resource_names = [local.kruise_ecr_token_secret_name]
	verbs = ["delete"]
	}

	rule {
	api_groups = [""]
	resources = ["secrets"]
	verbs = ["create"]
	}
	}

	resource "kubernetes_role_binding" "kruise_ecr_token_updater" {
	metadata {
	name = "kruise-ecr-token-updater"
	namespace = kubernetes_namespace.kruise_system.metadata[0].name
	}

	role_ref {
	api_group = "rbac.authorization.k8s.io"
	kind = "Role"
	name = kubernetes_role.kruise_ecr_token_updater.metadata[0].name
	}

	subject {
	kind = "ServiceAccount"
	name = kubernetes_service_account.kruise_ecr_token_updater.metadata[0].name
	namespace = kubernetes_service_account.kruise_ecr_token_updater.metadata[0].namespace
	}
	}

	resource "kubernetes_config_map" "kruise_ecr_token_updater" {
	metadata {
	name = "kruise-ecr-token-updater"
	namespace = kubernetes_namespace.kruise_system.metadata[0].name
	}

	data = {
	AWS_ACCOUNT = data.aws_caller_identity.current.account_id
	AWS_REGION = data.aws_region.current.name
	DOCKER_SECRET_NAME = local.kruise_ecr_token_secret_name
	}
	}

	resource "kubernetes_cron_job_v1" "kruise_ecr_token_updater" {
	metadata {
	name = "kruise-ecr-token-updater"
	namespace = kubernetes_namespace.kruise_system.metadata[0].name
	}
	spec {
	schedule = "0 /10 * *"
	job_template {
	metadata {}
	spec {
	template {
	metadata {}
	spec {
	service_account_name = kubernetes_service_account.kruise_ecr_token_updater.metadata[0].name
	container {
	name = "kruise-ecr-token-updater"
	image = "odaniait/aws-kubectl:latest"

	command = [
	"/bin/sh",
	"-c",
	local.kruise_ecr_token_updater_script
	]

	env_from {
	config_map_ref {
	name = kubernetes_config_map.kruise_ecr_token_updater.metadata[0].name
	}
	}
	}
	}
	}
	}
	}
	}
	}

	resource "helm_release" "openkruise" {
	name = "kruise"
	namespace = "kube-system"
	repository = "https://openkruise.github.io/charts/"
	chart = "kruise"
	version = "1.3.0"
	reset_values = true

	set {
	name = "installation.namespace"
	value = kubernetes_namespace.kruise_system.metadata[0].name
	}

	set {
	name = "installation.createNamespace"
	value = false
	}
	}

view raw imagepulljob.tf hosted with ❤ by GitHub

So we create a ServiceAccount that can write secrets to the EKS control plane, and then once every 10 hours, we fetch a new secret from ECR and then store it. That token will be available for OpenKruise to use, so then we install OpenKruise itself.

OpenKruise will run partially as a DaemonSet which is what I was already thinking of doing.

Now, in order to use the ImagePullJob, I simply added the following manifest to the kustomize manifests for each application:

Show hidden characters

	apiVersion: apps.kruise.io/v1alpha1
	kind: ImagePullJob
	metadata:
	name: sudokurace
	namespace: kruise-system
	spec:
	image: '911907402684.dkr.ecr.us-west-2.amazonaws.com/sudokurace:' # Tag set by kustomization.yaml
	pullSecrets:
	# Must match https://gist.github.com/abatilo/6b287265d541d06da567893c1522999f#file-imagepulljob-tf-L3
	- kruise-ecr-token

view raw imagepulljob.yaml hosted with ❤ by GitHub

Every time the image tag changed, which it would for every deployment, a new ImagePullJob would get scheduled and OpenKruise would download the specified image in for each node in the cluster. Now I wouldn’t have to wait for the images to be downloaded on that node whenever a page was requested.

Step 3: Reduce the polling interval for pod readiness

As I outline in the scale to 0 post, I settled on using the keda/http-add-on for scaling from 0. And this step is super boring because all I did for this one was set the DeploymentCachePollIntervalMS to a lower number. By default, it checks to see if the deployment has new pods once every 250 milliseconds. I dropped that to 100 milliseconds so that we’d notice more quickly once there was a pod ready to take traffic.

Wrapping up

I was really excited to get OpenKruise to work and start doing the ImagePullJobs. Unfortunately, or maybe fortunately, my images are all extremely tiny. They’re statically linked Go binaries and even though I include the entirety of each set of static assets for the website, the containers only end up being about 30-40 MB even while uncompressed. These take milliseconds for my EKS nodes to pull from ECR. They’re so fast that even after having the images pre-pulled, the containers basically didn’t come up any faster at all. Maybe in the 10s of milliseconds. The ImagePullJobs are also not retroactive, in the sense that they execute one time, at time of image tag update, but if new nodes come up, the image isn’t pulled there. Since I run everything on EC2 spot nodes, my nodes churn fairly often, which nullified the whole point of being able to pre-pull.

What really helped with scale from 0 time ended up being just reducing that polling interval. The containers now get scheduled and respond in about 2.5 seconds, so we’re shaving upwards of a full second off from the page load. Does it really make a difference? No, probably not, but it was fun to try regardless. I think if I had larger images like a Python or Node application, that the ImagePullJob would still be pretty worth it. Alas, that’s not the situation I’m in.

See you for the next one. Thanks for reading

A slice of experiments

Discussion about this post