Treating my EC2 instances like I'm Leonardo DiCaprio

Once they're too old, they gotta get out of here

May 09, 2023

I’ve talked plenty about how I use EKS in AWS for running all of my side projects. Using Kubernetes for managing all of my actual EC2 instances is great because the management of the instances is pretty much invisible to me. I’m a big fan of purposefully keeping very ephemeral instances for a bunch of reasons. There’s the security benefits, and there are the resiliency/chaos engineering benefits. I don’t think anyone’s actually trying to hack anything that I make, and I break my apps more than any kind of chaos engineering ever would, but just let me pretend that I know what I’m doing, okay?

Now, normally, I use Karpenter.sh as my cluster’s node management tool. I have a provisioner configured with a ttlSecondsUntilExpired that will delete instances if they’ve been around for more than 24 hours. The instances are already spot instances so they’re not as likely to even stick around for that long, but for low utilization instance types like the t3.mediums that I use, I’ve had instances stick around for months if I don’t do something about it.

Okay, so, Aaron, you’ve already got something that deletes instances that are old, so why are you making me read any of this? Well, it was decided that Karpenter should not be schedulable on any nodes that it provisions, by default. I think that kind of makes sense though. If for some reason there’s a problem with your provisioner, you don’t want Karpenter itself to be unable to run because there are no nodes. That means that while Karpenter can handle all additional worker nodes that need to serve your application requirements, you need to have some workers join your EKS cluster via other means.

The terraform-aws-eks module that I use for provisioning my cluster comes with a handy dandy built in variable for configuring the official EKS managed node groups. EKS managed node groups are a regular old autoscaling group with some extra bells and whistles. Here’s a description of the differences written by one of my newest favorite tools, Perplexity.ai:

An AWS Autoscaling Group and an AWS EKS managed node group are both used for automatic scaling and management of EC2 instances in an EKS cluster. However, there are some differences between them. An Autoscaling Group is a collection of EC2 instances that are treated as a logical grouping for the purposes of automatic scaling and management¹. It lets you use Amazon EC2 Auto Scaling features such as health check replacements and scaling policies. On the other hand, an EKS managed node group automates the provisioning and lifecycle management of nodes for your Kubernetes clusters². It provides an abstraction to Amazon EC2 instances and Auto Scaling groups, enabling a set of simple one-step operations in EKS to provision and manage groups of cluster nodes. It is compatible with the Cluster Autoscaler and is backed by EC2 instances in your account which are managed by an Auto Scaling group

But firstly, what’s wrong with the current setup? Why can’t some of these instances be a little bit more stable? WHAT PROBLEM ARE YOU TRYING TO SOLVE, AARON? Well you see, what’s been happening is that I kept this base node group at a desired size of 2 instances. One for each replica of Karpenter itself. Then any additional nodes that existed for my actual applications were provisioned by Karpenter. My cluster almost always runs with 3 nodes right now with how much compute I give each website, things like ArgoCD, etc. However, every 24 hours, the 1 node that’s managed by Karpenter would get terminated and replaced, and all of the pods that were scheduled to that node would end up moving to the existing nodes.

This would happen repeatedly and then after a few weeks of this happening every day, I’d get unlucky and end up with too many pods on a single node, and trigger problems with the container network interface trying to assign more IP addresses to a given node than is allowed. So I’d have to manually terminate those nodes to get new ones to rebalance the cluster. And this last time that I did so, I had a cascading failure, where my scale from 0 autoscaler had a bunch of requests queue’d up, which triggered a bunch of pods to be created, which made everything worse and zero Karpenter pods could get scheduled and so no new nodes came up. This is basically the worst case combination of things happening.

Let’s make these EC2 instances disappear after 24 hours also.

AWS Autoscaling Groups themselves have a configuration for MaxInstanceLifetime which is exactly what I want for these extra nodes. There’s just one problem, the abstraction for the EKS managed node groups doesn’t expose this property at all. So I could go back to using raw autoscaling groups for my nodes, but then I lose all the bells and whistles of the EKS managed node groups… so what do I do?

Like all good infra people know, you glue together a solution with some bash.

#!/bin/bash
asgs="$(aws autoscaling describe-auto-scaling-groups --filters Name=tag:kubernetes.io/cluster/red,Values=owned --query 'AutoScalingGroups[*].AutoScalingGroupName' | jq -r '.[]')"

# Loop through each asg in asgs and call aws autoscaling start-instance-refresh
for asg in $asgs; do
  aws autoscaling start-instance-refresh --auto-scaling-group-name $asg --preferences '{"InstanceWarmup":90,"MinHealthyPercentage":66}'
done

Viola! Let’s just grab all the autoscaling groups that are owned by my cluster (named “red”), and then trigger an instance refresh. Put this script into a container, then run it as a CronJob from within the EKS cluster itself and bing bang boom.

As an homage to Leonardo himself, I was hoping to have the instance refresher run once every 25 hours, but alas, it doesn’t appear that cron expression evaluators can actually work that way, even if it’s syntactically valid to say */25.

As an aside, by the time you’re reading this newsletter post, I’ve already publicly announced that I’ve left my position as an infra and tools engineer with Color. I’m ecstatic to share that I’ll be joining Cohere to do some infra work there. I’m very excited to be working full time in the large language models space, and you can expect that working in such a space will inspire some pretty interesting experiments and learnings for me to write about in the future!

A slice of experiments

Discussion about this post