Scaling event workloads by SQS queue depth

Continuing to be more scalable than needed

Jun 20, 2023

As part of the project I mentioned in my last post, files are uploaded from the user and stored in an S3 bucket. Then I need to process them and do something with that information. The type of file that gets uploaded is probably measured in the hundreds or thousands per user. And the processing might take several seconds or more per file and ideally, the users can upload as many of them at once as they have. I don’t want users to be worried about throttling themselves when they start uploading. As part of my plans, I want a desktop side client that just finds all the files in a folder and uploads them all.

One thing that I could do is only do the processing in bulk in some batch processing workflow. That’s lame though. I don’t want users, aka I don’t want to have to wait a day or a few hours to see the results of the analysis after I upload it. Maybe one day I actually have to do that to handle the scale, but if that’s the case, that’s an amazing success with the project. For now, instead what I’m doing is I’m just queueing up the upload events and putting them into an SQS queue. In my terraform, that basically looks like this:

resource "aws_s3_bucket" "project" {
  bucket_prefix = local.project_bucket_name
}

resource "aws_sqs_queue" "project" {
  name_prefix               = "project"
  message_retention_seconds = 900
}

resource "aws_s3_bucket_notification" "project" {
  bucket = aws_s3_bucket.project.id

  queue {
    events    = ["s3:ObjectCreated:*"]
    queue_arn = aws_sqs_queue.project.arn
  }
}

# And some permissions stuff that I didn't include

I have my S3 bucket, and then create an SQS queue and attach an s3:ObjectCreated notification on it. That means that every time an object is uploaded, a message gets put into an SQS queue with the full key to that new object.

Now, in my fever dream where I get thousands of users all trying to upload their backlog of files all at the same time, I can handle an unlimited number of upload messages. S3 shouldn’t have any problems keeping up with the direct uploads, and the SQS queue should be able to hold all the messages.

But Aaron, if you have unlimited messages in the queue, or just a ton of them, and it takes a few seconds per file to be processed, won’t that mean that your users will have to wait hours or days anyways? Just like you said you wanted to avoid with doing batch processing.

You’re right, listener person, and frankly I’m getting kind of sick of you pointing out all my flaws. Alas, we go on. I continue my adversarial relationship made up of these hallucinated conflicts.

Scaling the consumers

In my post about scaling from 0, I setup the KEDA scaler which has way more plugins other than just queuing up HTTP requests. Right now, the AWS SQS Queue scaler is looking mighty tasty. Unfortunately, it only supports scaling by queue depth. I really wish it would scale based on the average age of messages in the queue. I think that’s a much better metric for understanding the SLA/user experience of people who are waiting for their messages to be processed. We’ll make do though. This is already more work than I’ll probably ever need but it’s kind of the minimum need to make me feel good about the system’s design.

There’s a second order effect for why having this autoscaler is awesome. With SQS, one of the components to pricing is the number of API requests. Even if you make a request for messages and there aren’t any messages, you pay for that. The most common cause for a high number of SQS requests are empty receives. The KEDA scaler lets me scale these consumers down to 0, so if there isn’t any work to do, then there are no workers that are sending long poll requests.

Wrapping up

The last thing I want to talk about is the IAM permissioning of the KEDA controller. According to the configuration, there’s a way to use the identity of the application that is being scaled out to read the SQS metrics that are used for scaling. In practice, this is a little inconvenient because the way it works is that you have to have the trust policy to make the KEDA controller able to do the role assumption to the role of the application. I got lazy and instead of doing the extra step with the role assumptions, I just configured this scaler to use the permissions of the KEDA controller itself and then gave the KEDA controller identity the ability to read metrics about the relevant SQS queue. I don’t like having the permissions on this controller but… it’s my personal environment that no one else uses and it’s a read only set of permissions so meh.

That’s it for this post. Thanks for reading and see ya’ll for the next one.

A slice of experiments

Discussion about this post