TL;DR: I installed WSL on my gaming computer and then installed tailscale to expose it.
Recently, a friend of mine just started the Sillycon Valley newsletter where he shares what he learns about getting into AI and machine learning. He’s been playing with some image generation and asked me to run the code on my computer. With some of the memory optimization flags and running more on the CPU over my GPUs, I was able to get images to generate, but my image generation for my tower would take 2-3x as long as his laptop! I knew there had to be a better way.
I have two computers. I have an older tower that I built in ~2015 which I built to do some machine learning experiments at the time. It has two GTX 970s SLId together. Each card has 4GB of VRAM which was pretty good at the time. I could run a bunch of the open source models that were being published. Today, this is still my main computer and the one that I do all my development on. I also have a pre-built gaming computer that I bought in 2021 that has a GTX 3090 in it. I got this computer to get more into PC gaming again after years of hiatus. Hilariously, most of the time I’m only playing Super Smash Bros Melee on an emulator and I don’t need such a powerful graphics card.
Recently, all the transformer models have been getting larger and larger and the standard for a lot of these models is to use an NVIDIA A100 which has 40GB of VRAM. The larger models built by well capitalized companies are built on hundreds or thousands and maybe even tens of thousands of these. My 970s just can’t keep up anymore, so I haven’t done any ML experimenting in a while.
Since I have the 3090, I’ve thought about using that machine as my dev machine, but my brain really wants to keep it a Windows gaming PC. I think because that’s what I bought it for, that’s what I want to keep it as. You might tell me to just set it up with a dual boot, but I don’t want to do that either, because I don’t want to have to decide between the two options. For years I just left it as is and then one morning I realized that if I setup a Linux VM then I could just access the 3090 like I would access an EC2 instance.
I installed the Windows Subsystem for Linux with the default Ubuntu, and sure enough:
Without doing anything else, the WSL VM can already access the graphics cards. Perfect. The next thing that I did was configure an ssh server on the VM and then setup a port forward from the host computer to the VM. By following this tutorial, I was able to ssh right from my dev computer to the VM and thus to the graphics card.
At this point, I went and yoinked the code from my good pal, Sillycon Valley and his tutorial for running stable diffusion for doing image generation. And then to put my own flavor on the images I tested generating, I started generating some anime pictures. In real-life time, I spent a few hours experimenting with different parameters and memory and speed optimizations.
I also started doing some experimenting with some of the smaller open source chatbot models like Dolly from Databricks.
Serving the model
All it takes to run one of these models is:
import torch
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline
class Req(BaseModel):
text: str
app = FastAPI()
generate_text = pipeline(model="databricks/dolly-v2-7b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
@app.post("/generate")
async def gen(req: Req):
res = generate_text(req.text)
return {'response': res[0]["generated_text"]}
This is great. Now I can send requests from my dev machine and generate all kinds of stuff with my 3090 and put it to work. But what if I want to expose things to the internet? Or just my projects? Well, I don’t want to have to expose my home network. I don’t want to do any kind of port forwarding from my router, etc. That’s where the magic of tailscale comes in. Tailscale is a vpn service built on top of WireGuard. I installed the tailscale client in the WSL VM and now it’s on my private tailnet.
Next I had to figure out how to get a tailscale client into my Kubernetes applications, so that I could have my projects access the GPU all privately and securely. Tailscale themselves have some documentation for this. All I had to do was create and set an auth token and then add the tailscale image as a sidecar container.
It actually worked perfectly without any problems at all. I could instantly send requests to my HTTP server from containers in my cluster. Perhaps the craziest part for me is that latency from us-west-2, Oregon, to my computer here in Colorado was only about 80 milliseconds. The internet is amazing.
Wrapping up
I did discover that auth tokens in Tailscale have a max life of 90 days. And there’s no API for renewing them programmatically. I’d have to replace the token manually every 90 days unless I find a way to automate that through my browser or something.
Regardless, now I have image generation and text generation with some limited capability available to my experiments, all for a fraction of the price of having to pay a provider for every request. Since the open source, single computer models aren’t as capable, I might not change any projects, but maybe I will. Who knows!
Thanks for reading. See ya’ll in the next one.