How I built a scalable real-time multiplayer sudoku board in 3 days

Jun 26, 2022

I built sudokurace.io in about 30 hours over a long weekend. It’s an entirely free sudoku board where multiple people can play on the same board at the same time. In this newsletter post I wanted to go over the architecture behind it and how I solved my various problems!

Side note: I shared it on Hacker News as well as on some of my social accounts and it got a little attention. Page analytics are entirely public and open to view.

The back story

I’m a casual fan of playing sudoku. I’ve always had cycles of interest where I’d play a lot for a while and then put it down for months or years. It’s been nice as an airplane hobby where you can buy a little booklet or have an app and offline a bunch of puzzles to work on. I’ve always liked that you could stop working on a puzzle and pick it up whenever you wanted and generally you didn’t have to worry about getting back into context or anything like that. I had a math teacher in high school who would always put up a new puzzle on one of her whiteboards and let anyone who wanted to try to solve the puzzle. I would sometimes be in her room during my free period and eventually I started to work on the board just out of boredom and curiosity. A friend of mine would sometimes join me and it would always become a race. We never said it, we never acknowledged it, but it would become a race. We would be trying to find new numbers in before the other person and it always felt competitive.

Fast forward several years and any time I’ve been doing a sudoku in public, the people around me would always get into filling out squares with me. This could be me waiting for my next class in college, or this could be me doing something to decompress near the end of a work day, or even just taking a breather while at a family gathering. If I pulled out my phone to work on a sudoku puzzle, people would start telling me what numbers went where. Recently, I re-downloaded a sudoku app and I would work on a puzzle before bed like a night time ritual. My wife would start helping me fill out the board and I felt that same feeling of competitiveness and urgency to fill other cells. This is how the idea for sudokurace.io started.

When I worked at a previous smart home automation company, we used websockets to send real time updates to our frontend and also to send commands to our backend to set smart home device state. It worked really well overall but when we started doing things with the websocket, we had to figure out how we were going to synchronize state on the client side and the backend. One of my peers had come across jsonpatch and we went all in on it. Over the websocket, we would embed json patch actions and apply those to a copy of the state that we either kept in memory on the centralized hardware we had in each home, as well as on the client side. This was awesome, but it also meant that we had to have a handshake process for new clients that involved seeding the latest full state to the clients. Years later, I’d find that it was very similar to how slack works. An initial message would be sent and then that would trigger everything else that needed to happen for the clients to get the latest state. Whenever the device state changed for a device in a home, clients would receive a new patch operation which would then be applied to the client local state and the UI would update in accordance.

I wanted to know if there was something I could get away with that would be easier and faster for sudokurace.io. I could build all the same mechanisms but it felt like too much. Especially given that I was striving to build this entire website and project in a weekend. That’s when I came across how reddit built /r/place. The original version of Reddit’s /r/place website was built where the websocket was effectively only used for sending real-time notifications that there is new state, but didn’t necessarily send the entirety of state. Clients would receive the new tile notification, and could always send an HTTP request to get the full state. I loved this because it meant that I didn’t have to think as much about how to serialize and transfer any information over the websocket, and that’s how I decided I’d be implementing sudokurace.io.

The frontend

The frontend is a react application that I created with create-react-app, and I styled it with TailwindUI. I had the intuition that most people would want to play sudokurace.io on mobile, which was fairly simple to do given that Tailwind classes and breakpoints are always mobile first. One of the tricky bits was having to figure out how to do the sudoku board grid though. I needed it to be responsive but I needed everything to always be in a square.

I’m absolutely not much of a frontend developer. I can get around but it’s not very fluent or comfortable for me, which is why I’ve loved having Tailwind to use. Their documentation has always been great for me. My first intuition for building my square was to try and use flex boxes. In my mind, I thought that I could make sure that every item in every row was equally spaced and that I could do that for every row and have a perfect square. I spent literally hours fumbling after this and it just never worked the way that I wanted it to.

Eventually I came across aspect-square in Tailwind and CSS grid and started messing with that and it totally worked. I was able to create a container div with aspect-square and then do a grid-cols-9. I made a grid with 81 <p> elements that would house the number for that square, and I was off. I had my way of displaying the current board state.

I connected to my backend via react-use-websocket and it worked perfectly. However, the real magic came from using Vercel’s swr library. One of the key features for swr for me was the fact that the hook came with a mutate function. You could simply call the mutate function with no other parameters and that would trigger the library to re-fetch whatever data you had bound the hook to. The initial request to the game API would look something like this:

const { data: boardData, mutate: boardMutate } = useSWR(`/api/v1/game/${gameID}/board`, async (url) => {
  const boardResponse = await fetch(url);
  const { cells, score, gameOver } = await boardResponse.json();

  if (gameOver) {
    setShowGameOver(true);
  }

  return {
    cells,
    score,
  };
});

And thanks to that handy dandy mutate function, and given my architecture of using the websocket just for notifications, it meant that my websocket hook instantiation looked like this:

useWebSocket(`${window.location.protocol.startsWith("https") ? "wss" : "ws"}:${window.location.host}/notify`, {
  onMessage: (e: MessageEvent) => {
    boardMutate();
  },
  // ...Retry configuration
});

It was beautiful. It was simple. If the backend had some new state because someone else made a move, then the socket would send a message and the websocket on the client side would just call the mutate function. Originally, I had one swr hook for all the game data but after implementing a few different endpoints, it became clear that re-fetching all data over and over again was becoming too slow and laggy for gameplay. In the final version of this code, the websocket send a keyword which represents which data to update. So for example, the socket might send the word board or maybe score and then I’d conditionally call one of several mutate functions that would trigger swr to re-fetch data.

I had another challenge to figure out though. Could I make the backend fast enough, or would I need to make the UI optimistic? The swr library makes optimistic/local updates very easy but I wanted to know if it was necessary at all, so let’s jump to how I implemented things on the backend.

The backend

I built the backend using Go with the gorilla/websocket library and go-chi/chi router. For all database interactions, I used kyleconroy/sqlc with the jackc/pgx backend, golang-migrate/migrate for migrations, and the combo is spectacular. You can write all of your SQL first and treat the SQL as the first class citizen and then let sqlc handle all of the Go code. Otherwise, almost everything else was just standard library web server stuff.

The backend became a little more tricky for me. During my time at the smart home automation company, we hosted everything on Heroku. Managing the websocket was simple because it was invisible to us. We had a single dyno that would run and so the fact that it was stateful meant that we could always rely on the necessary websocket connection being accessible in memory. My infrastructure for sudokurace sits on an EKS cluster where I build and host all of my projects. I’m an SRE-type by full time job, and I get a huge kick out of playing with the AWS and Kubernetes ecosystems. As an aside, mabye describing my infrastructure for rapid prototyping would be a good newsletter article too! But anyways, these pods may be re-shuffled and all of my websocket connections and requests are load balanced to multiple replicas of my API.

I had to figure out how I was going to notify the right backend to update the right client. From the Slack talk I linked above, I knew that some people did this with a dedicated pool of servers that handled websocket connections. That wasn’t going to work for me though. It was too complex and overkill for what I wanted to do. At first I thought about maybe writing updates to my database and having backends poll for that. That was going to be needlessly complex too though. Then I thought that maybe I could use redis/AWS Elasticache, but honestly, I didn’t want to pay AWS for that at this point in time. It totally could have worked though, where I’d write to a key and subscribe to it. Another idea that I had was to maybe do some kind of sticky sessions with my reverse proxy? I think this would actually work, but I didn’t like the implication that requests always had to go to the same instance. I didn’t want to think about how this coordination would work whenever there was a new deployment or a pod was shuffled around. I did a bit more reading around and wondered if maybe I should run something like Kafka or Apache Pulsar in my projects EKS cluster, but that seemed like way more overhead than I wanted too! I was at a stand still, until I remembered nats.io. I remembered listening to an old episode of SE Daily about nats.

nats.io was perfect. It was super lightweight, and had a Helm chart that made it easy to setup an HA deployment in no time:

resource "helm_release" "nats" {
  name      = "nats"
  namespace = kubernetes_namespace.nats.metadata[0].name
  version   = "0.17.0"

  repository = "https://nats-io.github.io/k8s/helm/charts/"
  chart      = "nats"

  set {
    name  = "cluster.enabled"
    value = true
  }
}

Every time a new client joined a game, in the Go code, I would just make that instance subscribe to a nats.io topic for that game’s ID. That code looked something like:

func (s *Server) notify() http.HandlerFunc {
    var upgrader = websocket.Upgrader{} // use default options

    return func(w http.ResponseWriter, r *http.Request) {
        // ...logging and connection upgrade logic
        sub, err := s.nats.Subscribe(gameID, func(msg *nats.Msg) {
            log.Info().Msg("Received message from nats")
            err := ws.WriteMessage(websocket.TextMessage, msg.Data)
            if err != nil {
                log.Error().Err(err).Msg("Failed to write message to websocket")
            }
        })
        defer sub.Drain()
        if err != nil {
            log.Error().Err(err).Msg("Failed to subscribe to nats")
            return
        }
    // ..more websocket boilerplate
    }
}

And in every endpoint that would modify game state, at the bottom of the HTTP endpoint handler logic I basically had something like:

func (s *Server) addMove() http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
    // ...logic for writing to the database
        err = s.nats.Publish(gameID, []byte(time.Now().Format(time.RFC3339)))
        if err != nil {
            s.log.Error().Err(err).Str("gameID", gameID).Msg("Failed to publish message to NATS")
            http.Error(w, "Failed to publish message to NATS", http.StatusInternalServerError)
            return
        }
    // ...write HTTP response
    }
}

This worked phenomenally for me. nats.io was incredibly simple to use and was extremely quick. Now every HTTP request that would be used for modifying the game state could still be load balanced and the right instance would always know when clients needed to be updated.

The only other big decision that I had to make with the backend was how to model game state in the database. I needed the frontend to be able to query for the latest score and latest board state at any time. The first idea that I had was to have a database table where I’d update the board state and score on every move. I think this absolutely could have worked but I had the idea that maybe I would want to track a player’s accuracy for placing cells. This doesn’t currently exist and there isn’t a concept of a logged in user so I couldn’t do this now even if I wanted without more work, but I wanted to keep the idea available without needing too much re-work. This meant that I’d need to keep track of every time they were correct or incorrect. I could have done that by just always updating some column on the player if I had an account to update over time. What I ended up doing was basically the laziest thing I could think of. Every time a client sends the request to update the state of the board, I write that to a table as its own row in the database like so:

-- name: AddMove :one
INSERT INTO move (
  id,
  game_id,
  board_id,
  player_id,
  cell_index,
  value
)
VALUES ($1, $2, $3, $4, $5, $6) RETURNING *;

In the HTTP handler for getting the latest board state, I query for every single move and just apply it on top of the original starting board state. The code looks something like:

board := game.NewBoard(dbBoard.Cells, dbBoard.Solution)
for _, move := range moves {
  correct := board.SetCell(move.CellIndex, move.Value)
  if correct {
    score[playerIDToName[move.PlayerID]]++
  } else {
    // Reduce score by 1 unless it's already 0
    if 0 < score[playerIDToName[move.PlayerID]] {
      score[playerIDToName[move.PlayerID]]--
    }
  }
}

I do this work for every single request and at first I was worried that this would be way too much work to do every single time but then I remembered just how fast computers are and that the context of this work was a sudoku board. Chances are that the number of moves would never be beyond 100 even if someone was making dozens and dozens of guesses. You figure that a sudoku board itself only has 81 squares, and many will already be filled in as the starting point. It turns out that I had made a good bet. Even after needing to do all this extra work just to get the latest board state, the requests are handled in the tens of milliseconds. So I only ever store the starting state of the board and the moves to apply on top of that. I render the latest board state and score in memory and return that to the client side.

Wrapping up

This was a great project for me. I had a lot of fun and it was by far the least trivial project I had put together outside of work. In my mind I thought that the game might have blown up in popularity and so even though I didn’t go crazy with architecting it to be scalable, I’m happy with the trade offs I made. The appliation tier is stateless and the broker tier can be scaled out horizontally as well. Building something multiplayer and real-time was a lovely challenge in constraints and optimization. I knew that everything needed to be just fast enough and if things weren’t quick, the sluggishness would be frustrating and make the game feel less intense. Computers are fast though. Implementing the simplest options ended up working beautifully and let me build this entire game incredibly quickly. By the end of the long weekend, my wife and I were playing each other in sudoku matches with no problems whatsoever!

A few things that I did learn though. Building a game to feel fun is really hard. The very first version that I played with my wife that had a working game was nice and all, but there was no feedback. It was hard to know when you were correct, incorrect, or just what was happening all together. It didn’t feel like a game, it felt like a weirdly updating billboard. Many years ago I remember watching juice it or lose it which is an old but classic conference talk about polishing game UX. I didn’t go super crazy with any kind of polish but I did realize that even just a small amount of color animation could make the game feel so much better. No particle effects needed!

Lastly, it’s always a good reminder to focus on speed to delivery. I’ve already gotten a flattering and amazing amount of feedback and ideas for what sudokurace.io can become. While building it, I had a lot of the ideas that were shared with me and it almost felt embarassing to share the original version with people. Even now, it’s so simple and bare that it’s kind of embarassing, but all in all, I’m really happy that I put it out early. I built what I wanted to build, and that’s the ability for my wife and I to play sudoku on the same board!

Thanks.

A slice of experiments

How I built a scalable real-time multiplayer sudoku board in 3 days

The back story

The frontend

The backend

Wrapping up