This time I made GPT-4 answer 50,000 trivia questions. It's so much better
This new model is surprisingly improved
TL;DR: Across 49,717 multiple choice questions, GPT-4 was correct 84.4% of the time. Keep reading to see the breakdown by category
In January, I asked GPT-3.5 49,717 trivia questions from the uberspot/OpenTriviaQA data set. I learned a few things and there have been a few new models since then too.
Let’s get a few corrections in place. In that last post, I claimed that ChatGPT is actually the same as the GPT-3.5 model which is what I had API access to. It turns out that this isn’t entirely true. ChatGPT was running on a model that is now known as GPT-3.5-Turbo. Not only was that model fine-tuned for chat bot type interactions, but the model itself was optimized and presumably quantized. The API costs for the turbo model were about 1/10th the costs of GPT-3.5.
And yes, I was in the middle of running this same test on that model. When I made my original post, someone was kind enough to point out to me that the results of asking the models questions actually has improved results when you ask the model to generate an explanation BEFORE giving an answer. This has been called chain-of-thought. So I was inspired to try both GPT-3.5 and GPT-3.5-Turbo with this technique to see how the results would change. I didn’t actually finish doing any more tests with GPT-3.5 because the results after the first 10-15 thousand questions were almost identical, even with the chain of thought improved prompt. My GPT-3.5-Turbo, aka the confirmed ChatGPT model, did in fact see some improvement, but not much. I won’t share the breakdown this time around but the overall percentage of correct questions went from 66.7% up to 70.1%. Considering the model was 1/10th the cost to operate, that was awesome to have any improvements in correctness at all.
When the GPT-4 release announcement was made, I was quick to request access to the API. I was granted access just a few days later and I immediately went ahead and started sending all of the same trivia questions. This time I did a little bit more work on getting a prompt that showed promise. And this time I actually included a single example of how I wanted the model to explain its thinking.
Both GPT-3.5-Turbo and GPT-4 leverage the chat API format that was released with GPT-3.5-Turbo. This format doesn’t just have you send a prompt anymore, but actually lets you format your messages to look more like a chat history with back and forth between two entities. My actual prompt that I used for this GPT-4 test looks as follows:
Messages: []openai.ChatCompletionMessage{
{
Role: "system",
Content: `You are a knowledge assistant. I will ask you a multiple choice question and you will answer it.`,
},
{
Role: "user",
Content: `You will be given a question and then you will be presented with possible
answers to choose from. If you're not sure of the answer, make your
best guess and pick one of the answers.
Follow these instructions:
- Think out loud, step by step
- Insert two blank lines to separate your answer from your explanation
- Write one of your answers and write it exactly character for character as it appears in the list of possible answers
`,
},
{
Role: "user",
Content: `Question:
What language can Harry Potter speak?
Possible answers:
Goblin
English
Mermish
Parseltounge
`,
},
{
Role: "assistant",
Content: `Thinking out loud:
Harry never has a need to speak goblin, so I can eliminate that answer. English
is the known language that he speaks which leaves mermish and parseltounge.
Mermish would be the language that merpeople speak, so I can eliminate that
answer. That leaves parseltounge as the correct answer.
Answer:
Parseltounge`,
},
{
Role: "user",
Content: questionTemplate,
},
},
How were the results?
Category Correct Total GPT-3.5 GPT-4 Delta
brain-teasers 154 207 0.497585 0.743961 49.51%
video-games 415 599 0.517529 0.692821 33.87%
television 4033 5230 0.556597 0.771128 38.54%
entertainment 224 280 0.582143 0.8 37.42%
animals 1083 1366 0.596632 0.792826 32.88%
celebrities 2577 3196 0.597309 0.80632 34.99%
sports 2331 2840 0.608451 0.820775 34.90%
movies 3484 4314 0.613584 0.807603 31.62%
for-kids 648 759 0.638999 0.853755 33.61%
music 4677 5579 0.671447 0.838322 24.85%
literature 1126 1288 0.689441 0.874224 26.80%
hobbies 1059 1242 0.698068 0.852657 22.15%
general 2854 3290 0.700912 0.867477 23.76%
newest 2590 3016 0.701923 0.858753 22.34%
people 2457 2743 0.71965 0.895735 24.47%
technology 2160 2486 0.7321 0.868866 18.68%
world 4322 4875 0.732513 0.886564 21.03%
religion-faith 579 638 0.73511 0.907524 23.45%
history 1481 1645 0.746505 0.900304 20.60%
rated 1987 2185 0.750572 0.909382 21.16%
humanities 965 1097 0.75752 0.879672 16.13%
geography 771 842 0.774347 0.915677 18.25%
overall 41977 49717 0.667377 0.844319 26.51%
Improvements in every category, and pretty big ones at that. In the above chart, we’re still ordering the categories by how successful GPT-3.5 was, but you can see some pretty amazing numbers here. The brain-teasers category improved by almost 50%. The geography category went all the way up to 91.6%. GPT-4 is by every measure in this test, “smarter”. How did the explanations do?
Let’s look at the same questions as last time, but now with the explanations coming BEFORE an answer is picked.
Question:
The name of what is coded in the following: 86,smopuim?
Possible answers:
it does not mean anything
the name of a software product
a NASA aircraft
a secret CIA project
Thinking out loud:
First, I will try to find any pattern or code within the given sequence "86,smopuim". If the number 86 is used as a key for Caesar cipher, I can try to decrypt the part "smopuim". However, even after attempting that, I do not find any meaningful name or phrase.
Answer:
it does not mean anything
The last time we asked this question, GPT-3.5 got the answer correct, but the explanation that we got last time was basically a guess that the numbers 86 would only be used in a software product.
The explanation here actually feels more consistent given that the model wasn’t capable of reading text upside down. Reading text upside down would probably need to happen more in the training data before it would even be a consideration by the model.
Question:
What number multiplied by 10 equals the square of the same number, times 5?
Possible answers:
4
5
2
10
Thinking out loud:
Let x be the number. Then we're looking for x such that 10x = x^2 * 5. Simplifying the equation, we get 2x = x^2. Now let's check the given numbers.
- If x = 4, we have 2*4 = 8 and 4^2 = 16. This doesn't match.
- If x = 5, we have 2*5 = 10 and 5^2 = 25. This doesn't match either.
- If x = 2, we have 2*2 = 4 and 2^2 = 4. This is a match.
- If x = 10, we have 2*10 = 20 and 10^2 = 100. This doesn't match.
Answer:
2
Last time, this question was answered incorrectly, but this time, GPT-4 actually appears to have been doing the math to solve the equation. I think that this example is actually insane. The model took the word description of the equation to solve and converted it, SIMPLIFIED IT, and then the completed text went through and plugged in the equation.
Question:
A man is carrying three croquet balls across a bridge. The bridge has a sign that says bridge can only hold 200 pounds. The man weighs 195 pounds and the croquet balls weigh 2 pounds each. The man ends up crossing the bridge in one trip and no one else helps him. This is possible.
Possible answers:
False
True
Thinking out loud:
The man weighs 195 pounds, and each croquet ball weighs 2 pounds. He is carrying three croquet balls, which would be a total of 6 pounds added.
We calculate the total weight like this : 195 pounds (man) + 6 pounds (croquet balls) = 201 pounds. Since the bridge can only handle 200 pounds, and the total weight is 201 pounds, it should not be possible.
Answer:
False
In the last test, I noted that this is weird because this riddle has been disproven by people with wrinkly brains that know how physics works. The data set says that the answer here is True. Googling this riddle has the riddle solution listed as something like: “just throw the balls across the bridge” or “just juggle the balls and then you won’t have all 6 pounds of force on the bridge”. GPT-3.5 guessed True but made the assertion that 201 pounds is less than 200 pounds, which is obviously incorrect. GPT-4 correctly asserted all numbers, and responded that the answer is False, which is real world correct, but data set incorrect. I’m going to leave the data set as is, but I’m glad to see that GPT-4 knows that 201 pounds is heavier than 200 pounds.
Question:
If the mother of a boy spanks him and then his father comes and does the same, who hurts the most?
Possible answers:
The boys brother
The mother
The boy
The father
Thinking out loud:
In this situation, the boy is receiving the spanking from both his mother and father. The boy's brother does not have any involvement in the situation, so he wouldn't be hurt. The mother and father are administering the punishment, so they are not the ones receiving the pain. Therefore, the most reasonable answer would be the one who is actually experiencing the physical pain.
Answer:
The boy
GPT-3.5 and GPT-4 answer this question with a very similar explanation. Understanding the context of the actions being taken in the question is still amazing to me.
Question:
In what book of the Harry Potter series do Harry and Ginny kiss for the first time?
Possible answers:
Harry Potter and the Order of the Phoenix
Harry Potter and the Prisoner of Azkaban
Harry Potter and the Deathly Hallows
Harry Potter and the Half Blood Prince
Thinking out loud:
Harry and Ginny don't share their first kiss in the first few books, so I can eliminate the Harry Potter and the Prisoner of Azkaban. Harry Potter and the Deathly Hallows has too many other big events happening, which makes it less likely. It's either Harry Potter and the Order of the Phoenix or Harry Potter and the Half Blood Prince since their relationship develops more in those books. Harry starts to show his feelings for Ginny in the Order of the Phoenix, but I remember that their first kiss happens in the Half Blood Prince when they're celebrating a Quidditch victory.
Answer:
Harry Potter and the Half Blood Prince
Another one that GPT-3.5 and GPT-4 nailed. Not much else to talk about for this one.
Question:
Harry Potter marries Hermione in the last book of the series.
Possible answers:
True
False
Thinking out loud:
Harry Potter and Hermione Granger are close friends in the series, but they do not have a romantic relationship. Harry ends up marrying Ginny Weasley, and Hermione marries Ron Weasley. So, claiming that Harry marries Hermione in the last book is incorrect.
Answer:
False
Last time, GPT-3.5 got this incorrect and I posited that maybe fan fiction or general internet sentiment is what influenced the model to think that Harry Potter and Hermione were married. GPT-4 knew differently this time.
Wrapping up
The patterns that we saw from the last test are the same. Categories with long and well known information and facts still perform the best, but GPT-4’s ability to “reason” through a problem appears to have greatly improved according to the results of the brain teasers category. Rumor has it that GPT-4 is actually a trillion parameter model, as opposed to GPT-3/3.5 which has been confirmed publicly to be a 175 million parameter model. The turbo variant hasn’t had its number of parameters confirmed but it’s safe to assume that its less than GPT-3.5 given how much cheaper the API pricing is.
So what’s changed? Well, I’m purely speculating here to be very clear, but if GPT-4 is truly a 1 trillion parameter model, then it’s possible that maybe the larger model can just “remember” facts more accurately. Maybe GPT-4 isn’t actually a trillion parameter model and instead is a similar size but the training parameters have resulted in a better ability to produce “correct” results.
Honestly, I have no idea what’s happening, but it’s really interesting to see that this model, which is practically just a more sophisticated autocomplete, can actually “remember” such a wide variety of content. Remember, this isn’t a search engine that’s trying to find the answer. This is a model that’s generating words a few characters at a time that happens to generate tokens that turn out to look like very practical and accurate human reasoning.
We’ll all just have to stay tuned to see what else comes out of this.
Just like last time, I paid for these API requests out of pocket. I sent tens of thousands of requests to the GPT-3.5, GPT-3.5-Turbo and GPT-4 models, this experiment cost me about $759 for all of those requests. I’d be forever grateful if you’d consider buying me a coffee.