Distributing stateless network traffic across multiple clouds

I got them multi-cloud index.html pages

Jul 18, 2023

I started a new job recently and for the first time in my career, I’ve had to start using Google Cloud. Me being the Kubernetes shill that I am, that meant spinning up a GKE cluster as my way of learning more about the Google Cloud ecosystem. As a long time (~7 year) AWS person, honestly I kind of hate Google Cloud. They make it easier to get started than AWS, but they do so by picking a bunch of defaults for you and kind of hiding that away, and I disagree with almost all of the defaults that get set.

That being said, I can appreciate that if you just care to get something running, you can do that a lot more easily with GCP than with AWS. But if you just want to get something running, IMO you should be using a more managed platform. Anyways, that’s all probably for another day or another post.

Thanks terraform-google-kubernetes-engine module, I was able to get a spot node GKE cluster up and running pretty easily. On top of that, I use ArgoCD for deploying my side projects and registering another cluster as a target was also really straightforward. I had to create a Kubernetes service account and pass that token to ArgoCD, and I hate static credentials, but that’s fine for now. I think one day I might change it around to do some OIDC based “impersonations” as GCP likes to call them.

Okay, cool. I have two clusters up. Now what? Do I create a federated mesh between them? Is there something easier? But let’s start with me telling you about what I actually did first that worked super poorly.

The first thing that I did was to run the container that I use for serving https://mentallyanimated.com but in both clusters. It’s just a single static HTML page, but you can replace that with any stateless service. I learned that by default, GCP will actually install an ingress controller for you that leverages a “classic” HTTP(S) load balancer for you. Okay, simple enough. I learned about things like the BackendConfig CRD. More notably, I learned about the ManagedCertificate CRD which would request an SSL certificate. I pointed my DNS for https://mentallyanimated.com and an hour later my SSL certificate was provisioned.

My SSL certificate was provisioned… wait, how? I couldn’t find any actual documentation about the details, but I’m assuming that the GCP load balancer is doing an HTTP01 ACME challenge and serving the validation HTTP request automatically for me. I couldn’t actually find any way to forcefully make the ManagedCertificate CRD do a DNS validation for the SSL certificate. Little did I know that this would bite me later, but whatever. In the moment, I was excited. I had my first application scheduled in GKE.

In order to do cross cloud, I ended up doing the most simple thing I could think of. I did 50/50 weighted DNS. Essentially, this is what I did:

resource "aws_route53_record" "gcp_www" {
  weighted_routing_policy {
    weight = 10
  }
  # ...everything else
}

resource "aws_route53_record" "aws_www" {
  weighted_routing_policy {
    weight = 10
  }
  # ...everything else
}

And it actually worked beautifully. I was now sometimes serving my page from AWS and sometimes from GCP. I was multi-cloud. There was just one problem. I had forgotten to add the alternative subject for my ManagedCertificate for the apex domain. https://mentallyanimated.com with no subdomain. And this is where I started to actually notice problems.

Firstly, it doesn’t seem like GCP technically supported that. Even though I was doing an A record from the domain to a static global IP address that I provisioned through GCP, the SSL certificate would just *never* provision. I waited days to see if that would help. I think because I was serving everything from two clouds, the HTTP01 challenge that I believe is happening behind the scenes was never consistently resolving, because of the DNS load balancing. So I gave up on that.

Cloudflare tunnels

Several years ago, an old mentor of mine told me about how he had successfully migrated to having completely and utterly private Kubernetes clusters, including the control plane. For whatever reason, my brain remembered that conversation and remembered that he had done so by using Cloudflare for ingress. Sure enough, there’s actually a Cloudflare ingress controller that leverages their tunnels for creating a connection from your environment into their environment, all through an egress only connection. The connection comes from your environment and are authenticated into their environment. Unfortunately, the ingress controller is super outdated. I started looking at their docs and found that they actually have documentation about exposing applications in Kubernetes, exactly like I wanted.

I reached out to the same mentor and asked if he was using the ingress controller or if he went another route. He said that he runs the cloudflared tunnel service as its own Deployment in his clusters and forward traffic from there to a gateway/reverse proxy. I had an extremely similar but different idea. documentation about exposing applications in KubernetesWhat if I run cloudflared as a sidecar to my ingress controller? I tried it, and it worked phenomenally, and all it took was a few lines of terraform to make it work. All I had to do was define a cloudflare_tunnel resource, a cloudflare_tunnel_config, and then add my additional sidecar container. It basically all just looks like this:

resource "cloudflare_tunnel" "mentallyanimated" {
  account_id = local.cloudflare_account_id
  name       = "mentallyanimated"
  secret     = data.aws_ssm_parameter.cloudflare_mentallyanimated_tunnel_secret.value
}

resource "cloudflare_tunnel_config" "traefik" {
  account_id = local.cloudflare_account_id
  tunnel_id  = cloudflare_tunnel.mentallyanimated.id

  config {
    ingress_rule {
      service = "http://localhost:8000"
    }
  }
}

resource "helm_release" "traefik" {
  # ...everything else
  values = [
    <<EOF
deployment
  additionalContainers:
  - name: cloudflared
    image: cloudflare/cloudflared:latest
    args:
    - tunnel
    - run
    - --token
    - ${cloudflare_tunnel.mentallyanimated.tunnel_token}
EOF
  ]
  depends_on = [module.traefik_crds]
}

Since cloudflared and traefik are in the same pod, all I have to do is have cloudflared forward traffic to http://localhost:8000 where traefik is listening. Traefik is already my ingress controller and will redirect traffic to my other applications based on the Host header. Having cloudflared as a sidecar means that when traefik scales out, so will cloudflared.

The best part is that since I’m using traefik as my ingress controller, it’s already cloud agnostic. I don’t need to expose it using a cloud specific load balancer anymore so I can get rid of my NLB and I can disable the GKE default ingress controller. My traefik setup in both clusters can be exactly the same and Cloudflare will do the round robin load balancing across different clouds. It’s “that easy”.

This solves the problem of doing stateless applications. I’ll throw “how to do stateful applications” cross cloud on to my to-do list. For now, this is perfect.

A slice of experiments

Discussion about this post