Building a personalized Ask Me Anything chat bot using HTMX and gpt-index

It actually kind of works?

Feb 14, 2023

TL;DR: Check out https://resumegpt.aaronbatilo.dev

Quick preamble. There’s several levels of experiments here. Maybe most notably, the main contents of this post were actually recorded with my Google Pixel 7 Pro recording app, and then I asked ChatGPT to help correct and fix any punctuation and grammatical errors. I thought it would be interesting to see how well ChatGPT would do with the flaky speech-to-text, and I took my dog on a walk at the same time. I only did some very light manual editing on the content to make up for any random things I said while on the walk!

an oil painting of a steam punk robot that's holding a piece of paper with a question mark on it

It's Sunday, February 12th and I'm taking Chopper for a walk. Today, I was hoping to talk about my next newsletter post, as there's a lot to discuss. Recently, I came across a post in the Lenny newsletter about training a chatbot to answer questions based on their content. I found it interesting and wondered how easy it would be to do the same. The Lenny newsletter chatbot used a library called GPT Index. Initially, I looked into fine-tuning my own GPT-3 model, but I didn't have enough content to train it. While some articles suggested that only a couple of dozen examples were enough, I didn't want to go that route. So, I thought of making my chatbot using GPT Index, but I needed content to feed into the model. I thought of writing about myself and my work history.

The first major idea that I thought of was to have a chatbot that was intended for the kinds of conversations that you have with a technical recruiter. With all of the layoffs that have been happening recently, it made me wonder if people would be interested in some tool that would help them skip that first step and try to save everyone some time.

Questions like 'Are you authorized to work in the United States?' could be answered by the chatbot. While this might seem like a frequently asked questions page, most of my projects are experimental and don't necessarily solve a problem, but I don't feel bad about it. So, I started writing a bunch of text about myself, my life, my history, and my work history.

Then, I downloaded the GPT index and ran almost the exact code that was used in the Lenny newsletter.

from gpt_index import (GPTSimpleVectorIndex, LLMPredictor, PromptHelper,
                       SimpleDirectoryReader)
from langchain import OpenAI


def construct_index(directory_path):
    max_input_size = 4096
    num_outputs = 1024
    max_chunk_overlap = 60
    chunk_size_limit = 800

    # define LLM
    llm_predictor = LLMPredictor(
        llm=OpenAI(
            temperature=0.2, model_name="text-davinci-003", max_tokens=num_outputs
        )
    )
    prompt_helper = PromptHelper(
        max_input_size,
        num_outputs,
        max_chunk_overlap,
        chunk_size_limit=chunk_size_limit,
    )

    documents = SimpleDirectoryReader(directory_path).load_data()

    index = GPTSimpleVectorIndex(
        documents,
        llm_predictor=llm_predictor,
        prompt_helper=prompt_helper,
    )

    index.save_to_disk("index.json")


if __name__ == "__main__":
    construct_index("data")

It worked well without any additional information or configuration. That was really interesting, and I decided to understand more about what GPT index does.

GPT index is actually more like a search engine than anything else. It provides several ways to easily source data from your file system or Google Docs, for example, and then convert and run them through the GPT language model embeddings. So what are these embeddings? Simply put, they are numerical representations of words, and since they are numerical, we can perform mathematical operations on them. For every document you have, GPT index will call the embeddings at a point, pass in that document, and then store that in a location of your choosing.

I learned that GPT Index can store the vectors that represent your documents in various databases like Pinecone or Weaviate. But since I did not have a lot of documents, doing everything in memory is completely reasonable. Once we have all of these embeddings then I started to dig deeper.

The vector models use a similarity metric to search for the most similar content from your list of documents. GPT Index has several prompts available that can be used to combine or rewrite the results found from the similarity search. For example:

DEFAULT_REFINE_PROMPT_TMPL = (
    "The original question is as follows: {query_str}\n"
    "We have provided an existing answer: {existing_answer}\n"
    "We have the opportunity to refine the existing answer"
    "(only if needed) with some more context below.\n"
    "------------\n"
    "{context_msg}\n"
    "------------\n"
    "Given the new context, refine the original answer to better "
    "answer the question. "
    "If the context isn't useful, return the original answer."
)

The similarity search is why specialized databases are used, as they are very good parallelizing the computations needed for doing large scale comparisons. Every document in your entire library needs to be compared with the prompts or questions that were passed to you.

Now, you have a working chatbot that can answer questions. The next thing to do is to expose it with a web app so that I could share the link with people.

This brings us to experiment part two. For this post, we used a library called htmx to build the interactivity of the web page. Normally, I build websites in React and TypeScript, but a co-worker recently told me about htmx and I was curious. And since all I needed was a text box and a button, it seemed like a good opportunity to try htmx. Before we get too far into it, I have to say I think I am actually a fan. So what is htmx? htmx is a way to add interactivity to your website where you don't have to write any JavaScript to do the various functionalities of a modern web app. You can add a set of defined attributes to your HTML elements.

You can do things like send HTTP requests and your back end needs to respond with HTML fragments. I have a lot to learn about using htmx, but I was pretty easily able to write my search/chatbot interface to send the question to my back end. In as simple as a few HTML elements, I was ready to send a request to my backend.

<form hx-post="/query" hx-target="#results">
  <label>
    <span>Ask me your own question</span>
    <textarea"></textarea>
    <button type="submit" hx-indicator="#indicator">
      Search
    </button>
  </label>
</form>

<div id="results"></div>

Now, one thing to get into is how did I deploy all of this. I have a set of templates and patterns that I use in a monorepo, which I have laid out in a previous newsletter post. But the whole template pretty much relies on using Go.

However, GPT index is a Python library. One thing that I thought of doing was to take the implementation of the GPT index, or at least the bare minimums for what I was using, and rewrite that in Go. Which, maybe I will do one day.

Edit: As of February 18, 2023, I did rewrite just the part of gptindex that I was using, but in Go. Requests are on the order of 70% faster, and I’ve reduced the CPU and memory footprint by about 90%. Start up time is 2-3 seconds down from 40-50 seconds which means I can scale this service to 0, which I talked about in the past

But for now, at the time of writing, I just wrapped the single library call that I used in the GPT index into a FastAPI application and put all of that into a Docker container. Then I could go back to using a Go-based backend to serve my HTML file. And I would send requests from the Go application to the Python application.

from fastapi import FastAPI
from gpt_index import GPTSimpleVectorIndex
from pydantic import BaseModel

app = FastAPI()
index = GPTSimpleVectorIndex.load_from_disk("index.json")


@app.get("/")
async def root(query: str = ""):
    response = index.query(query)
    if response.response is None:
        return {"error": "no response"}
    return {"response": response.response.strip()}


@app.get("/healthz")
async def healthz():
    return "ok"

Eventually, I got that all wired up and deployed onto my Kubernetes cluster that I use for hosting all of my applications. You can visit https://resumegpt.aaronbatilo.dev and ask me questions, or rather, ask a question about me.

At this point, I have a functioning MVP where users can ask questions and get answers. The next thing I decided to focus on was experimenting with ways to format my data and documents to best work with the way that GPT index fetches the answers.

Currently, I am using the SimpleDirectoryReader, where each file in a folder corresponds to a single document. The default for similarity is to take the contents of the most similar document in its entirety and rewrite that for the response. For simple questions, that means breaking up my documents into several files that are three to five sentences long. But there is a trade-off because the more documents I have, the more similarity comparisons need to be made to find the most similar document.

On top of that, when GPT index is synthesizing its response, sometimes having the additional context that surrounds the specific sentence or paragraph that answers the question can be useful in crafting a more thorough answer. So, I have to take that into consideration.

Then there is the top N for similarity, aka, how many documents do I take out of the index for crafting the response.

There is another trade-off to consider when splitting information across multiple documents for GPT index. It's important to take into account the combined index and context, which may not be possible unless you increase the number of documents to retrieve. Sharding the data for context like work history can also be a challenge, especially when answering questions about experience with specific technologies. For instance, if a recruiter asks about projects done with Python and the default similarity top N is set to one, the bot would only retrieve a project description from a single job, even though Python was used in multiple projects across multiple jobs.

Now, I need to invert the way that I split my data across documents to get a thorough response. Another possible idea is to duplicate the information into multiple documents and hope that the similarity search picks the document where I've repeated information. For example, I can have documents for each job and also have documents that are specific to describing experience with a single technology, such as a text file with descriptions of projects I've done in Python. I don’t think I found the exact right balance in my time frame, but I think there’s a lot of potential.

I'm still very curious about what kinds of results I could get if I fine-tuned the general GPT-3 model. One benefit of this search style approach is that it minimizes the chance of hallucinations in the model, aka answering questions with information that is not true or is not in the index. This chatbot only surfaces preexisting information. So I shouldn't be too worried about it making ridiculous responses.

I shared this URL with a couple of friends of mine, and it's no surprise every single one of them, including my wife, came up with questions to ask it that I did not prewrite. That isn't all that unexpected. But it was still a fun project nonetheless.

What do you think, dear reader? Do you think personalized FAQ style chatbots are worth it? If you’re searching for a job, do you think you would ever send a recruiter a message linking to your own resume chatbot? And if you are a recruiter, how do you feel about being given a link like this to ask all your questions too? I would love to know what you think.

Manuel Fittko

Apr 2, 2023

I wonder if you would think about putting the go rewrite of llama index into an actual library. Getting this stuff faster would be awesome as we're also struggling with the current sluggishness of it. Would be great if you could maybe share a blog post on this alone, this sounds super promising (although not familiar with go).

Expand full comment

3 replies by Aaron Batilo and others

3 more comments...

A slice of experiments

Discussion about this post