TL;DR: Across 49,717 multiple choice questions, ChatGPT was correct 66.7% of the time. Keep reading to see the breakdown by category
About 6 years ago, I had the idea to see how well you could answer trivia questions with AI. I used word2vec and picked answers based on which multiple choice answer had the highest cosine similarity to the average of the vector of the sentence. Simply put, this… didn’t work at all. The percentage of questions that this method got correct was not notably better than selecting answers at random.
When the word2vec paper first came out, it was revolutionary that we could do analogies for the semantic meanings of words. “King - Man + Woman = Queen”. I was floored. The original paper came out in 2013, and here we are in 2023. A whopping 10 years later and countless generations of advancement in machine learning, comprehension, understanding, etc. In June of 2017, we were blessed with the “Attention is all you need” paper which introduced the transformer architecture that was the step function that has brought us this revolution in natural language processing. So, what’s different now?
In late 2022, OpenAI released ChatGPT. Which, at the heart of it all, is a brilliant UX that someone came up with for interacting with GPT-3.5. The original GPT-3 was made available in the summer of 2020 in private beta, and while there’s a playground UI, usage of GPT-3 was still fairly intimidating for the every day person. The playground featured all the knobs and parameters for tuning GPT-3’s response and that turned off many people from using it. ChatGPT on the other hand was… a chat bot. It looked no different than talking to someone in Slack or Microsoft Teams. Anyone could sign up, there was no wait list, and no scary knobs. ChatGPT gained 1 million users in only 5 days. For the rest of this article, I will refer to the model as ChatGPT, even though that’s not perfectly accurate since ChatGPT is a fine-tuned version of GPT-3.5.
So, how smart is it? 6 years later from my original experiment, I wanted to find out what the capabilities are of some of the most impressive machine learning we’ve seen to date. We’ve seen the creative works of ChatGPT. I even made a fun website for my wife and I to use: https://catstories.ai. ChatGPT is amazing at creative tasks, but for tasks that involve more knowledge, the opinions are significantly more split. Some people are convinced that ChatGPT, with it’s intuitive interface, is going to replace traditional search engine’s like Google. Microsoft is integrating ChatGPT directly into its search engine, along with other products.
ChatGPT is taught everything it knows by reading vast amounts of the internet, and it uses that information when it generates text. For the non-technical, you can think of it like your phone’s autocomplete, but way smarter. If you’re clever about how you prompt ChatGPT, you can get it to do revolutionary things and test its knowledge. For example, ChatGPT was able to pass the Bar exam. You can have it write marketing copy, or write emails for you. But if we’re going to use it as a source of truth to replace things like search engines, how does it do with general knowledge?
How did I test it?
The uberspot/OpenTriviaQA dataset is a creative commons data set of multiple choice questions. The questions are split into 22 categories such as: animals, entertainment, history, movies, and sports. Most of the questions have 4 possible multiple choice answers, but some of the questions are true or false questions.
Questions come in the following format:
#Q What is the name of Rocky Balboas restaurant?
^ Adrians
A Rockys
B Mickeys
C Adrians
D Apollos
First, you have the question itself. Then you have the correct answer. Lastly, you have your options for answer, one on each line.
My script would download each of the categories and ask each and every question to ChatGPT. For the above question, I would format my prompt to look like so:
I am a highly intelligent multiple choice trivia bot. You are given a multiple choice question. You must choose the correct answer from one of answers. Only include the answer and nothing else.
Question:
What is the name of Rocky Balboas restaurant?
Possible answers:
Rockys
Mickeys
Adrians
Apollos
Your answer:
I would then send this to ChatGPT and compare its answer with the correct answer.
Soooooo, how’d it do?
Here’s the raw data, sorted from least to most correct:
Category Correct Total Percentage
brain-teasers 103 207 0.497585
video-games 310 599 0.517529
television 2911 5230 0.556597
entertainment 163 280 0.582143
animals 815 1366 0.596632
celebrities 1909 3196 0.597309
sports 1728 2840 0.608451
movies 2647 4314 0.613584
for-kids 485 759 0.638999
music 3746 5579 0.671447
literature 888 1288 0.689441
hobbies 867 1242 0.698068
general 2306 3290 0.700912
newest 2117 3016 0.701923
people 1974 2743 0.71965
technology 1820 2486 0.7321
world 3571 4875 0.732513
religion-faith 469 638 0.73511
history 1228 1645 0.746505
rated 1640 2185 0.750572
humanities 831 1097 0.75752
geography 652 842 0.774347
overall 33180 49717 0.667377
It’s really interesting to see how these categories breakdown. Notably, ChatGPT did not do well with the brain-teasers category. Let’s look at a few examples of these questions.
#Q The name of what is coded in the following: 86,smopuim?
A it does not mean anything
B the name of a software product
C a NASA aircraft
D a secret CIA project
This one took me a while to figure out but ChatGPT got it correct. The answer is B. Those characters say windows,98 upside down. When I was talking to a friend of mine over at Fearless AWS, he suggested that I try to get ChatGPT to explain itself. So I did that.
I am a highly intelligent multiple choice trivia bot. You are given a multiple choice question. You must choose the correct answer from one of answers. Only include the answer on the first line. On the next line, explain your answer.
Question:
The name of what is coded in the following: 86,smopuim?
Possible answers:
it does not mean anything
the name of a software product
a NASA aircraft
a secret CIA project
Your answer:
the name of a software product
The numbers 86 and smopuim are likely a code for a software product, as they are not a recognizable acronym or phrase.
At first, I was impressed that ChatGPT was able to figure this out but it actually seems like ChatGPT just guessed.
Let’s look at an example where ChatGPT was incorrect.
I am a highly intelligent multiple choice trivia bot. You are given a multiple choice question. You must choose the correct answer from one of answers. Only include the answer on the first line. On the next line, explain your answer.
Question:
What number multiplied by 10 equals the square of the same number, times 5?
Possible answers:
4
5
2
10
Your answer:
5
Explanation: 5 multiplied by 10 is equal to 50, which is the square of 5, times 5.
Here’s a great demonstration of how ChatGPT can be pretty dumb. It got arithmetic completely wrong. Because it just generates text, that doesn’t mean that it has correctly encoded arithmetic. The correct answer here is 2. 2 squared is 4, times 5 is 20. 2 times 10 is also 20.
Let’s look at another example where ChatGPT was correct according to the data set, but the explanation doesn’t quite add up:
I am a highly intelligent multiple choice trivia bot. You are given a multiple choice question. You must choose the correct answer from one of answers. Only include the answer on the first line. On the next line, explain your answer.
Question:
A man is carrying three croquet balls across a bridge. The bridge has a sign that says bridge can only hold 200 pounds. The man weighs 195 pounds and the croquet balls weigh 2 pounds each. The man ends up crossing the bridge in one trip and no one else helps him. This is possible.
Possible answers:
False
True
Your answer:
True
The man is carrying three croquet balls that weigh 2 pounds each, for a total of 6 pounds. This means that the man and the croquet balls combined weigh 201 pounds, which is less than the maximum weight of 200 pounds that the bridge can hold. Therefore, the man is able to cross the bridge in one trip.
So, this is a really interesting one. The data set doesn’t have an explanation for any of these brain teasers but it lists the correct answer as “true”. If you do some googling, you’ll find that the answer of the riddle is supposed to be “true” because the man can juggle the balls. Ignoring the actual physics of such a situation, the ChatGPT answer, once again, gets the arithmetic completely incorrect in saying that 201 pounds is less than the maximum weight of 200 pounds.
Let’s look at one final example that addresses something in the NLP field called the Winograd schema. Which, to over simplify, is basically being able to understand how pronouns resolve to ambiguous actors in the same sentence.
I am a highly intelligent multiple choice trivia bot. You are given a multiple choice question. You must choose the correct answer from one of answers. Only include the answer on the first line. On the next line, explain your answer.
Question:
If the mother of a boy spanks him and then his father comes and does the same, who hurts the most?
Possible answers:
The boys brother
The mother
The boy
The father
Your answer:
The boy
Explanation: The boy is the one being spanked, so he would experience the most pain.
In this answer, ChatGPT explains that the boy is the one being spanked and so even though there might be some ambiguity between the usages of “him” and “his” referring to either the boy or the father, ChatGPT understands the context of the boy being spanked.
These brain teasers require logical conclusions that it seems like ChatGPT doesn’t really reason through. It makes sense that this is the category that ChatGPT has failed most on, but there are also some impressive answers. That being said, even asking ChatGPT to explain itself is a generated response. There’s nothing to say that this is legitimately how ChatGPT is thinking about the problems.
Let’s look at another category, literature, which presumably ChatGPT should be good at because there should be plenty of text and references to learn about the literature.These brain teasers require logical conclusions that it seems like ChatGPT doesn’t really reason through. It makes sense that this is the category that ChatGPT has failed most on.
I am a highly intelligent multiple choice trivia bot. You are given a multiple choice question. You must choose the correct answer from one of answers. Only include the answer on the first line. On the next line, explain your answer.
Question:
In what book of the Harry Potter series do Harry and Ginny kiss for the first time?
Possible answers:
Harry Potter and the Order of the Phoenix
Harry Potter and the Prisoner of Azkaban
Harry Potter and the Deathly Hallows
Harry Potter and the Half Blood Prince
Your answer:
Harry Potter and the Half Blood Prince
Harry and Ginny kiss for the first time in Harry Potter and the Half Blood Prince, when Harry visits Ginny in the Gryffindor common room.
In this case, ChatGPT is correct. Harry Potter is popular enough with a large fandom that it makes sense that this is probably talked about alot.
I am a highly intelligent multiple choice trivia bot. You are given a multiple choice question. You must choose the correct answer from one of answers. Only include the answer on the first line. On the next line, explain your answer.
Question:
Harry Potter marries Hermione in the last book of the series.
Possible answers:
True
False
Your answer:
True
Explanation: In the epilogue of the final book, Harry and Hermione are married and have two children.
Another really interesting case here. Harry marries Ginny and has two kids, not Hermione. Trying to think through how this might have happened, I wonder if internet comments and maybe fan fiction that was part of the training set had enough content that preferred Harry and Hermione to be together. The implication there for the greater knowledge base of ChatGPT is that you could influence what it deems to be true if there’s enough of an *opinion* in its training data.
Wrapping up
Like I said, there are almost 50,000 questions. Digging into each and every single one of them is not practical for a single post, but it is really interesting to try to intuit about why ChatGPT is better or worse at various categories. In categories like brain-teasers, there’s much more logic than needs to be applied and combined to answer some of the questions. In other low scoring categories like television and video games, my guess is that because the content itself is not in an indexable / consumable format, ChatGPT doesn’t have as much information about it. ChatGPT hasn’t played the games themselves or watched the television shows. There might be 3rd party content ABOUT what happened in the shows but the contents of the show would be mentioned in passing instead.
In high scoring categories, the categories all tend to be for contents that have been documented in text long before we had audio, video, etc. The geography category and the history category have content that has existed long before we had the technology and storage to regularly keep things like television content.
All that being said, a 66% correctness is still pretty impressive given such a vast selection of topics. But as we learned, sometimes the answers are reportedly correct but the explanations are not always correct. I don’t think ChatGPT or any large language model that we have right now is the strong AI or general AI that some articles want you to believe it is. I don’t think any large language models that are generative in nature should replace traditional search engines that surface high quality information. There are search engines like you.com that are exploring the intersection but it’s still early days for this technology!
Between development and doing the final runs, this experiment cost me about $100 to hit the GPT-3 API for all the questions. I’d be forever grateful if you’d consider buying me a coffee. If a company would be willing to sponsor the work, it’d be interesting to try different prompts, randomizing the orders of the answers, try running the tests multiple times, etc. I’d also love to run the same test over the full Jeopardy question set, but given how expensive this test was, running the test over a question set almost 10x as much would be too much for me to do right now.
As always, the code that I wrote for this is all available. Check out this post’s code, including the answers of ChatGPT, by clicking here.