TL;DR: https://huggingface.co/datasets/abatilo/myanimelist-embeddings
While writing this, I’m still in between jobs. I wanted to still take a rest while on the break but also I wanted to do something to keep myself from getting too restless. So, I embedded some anime text!
So, what is a text embedding? Simply put, machine learning doesn’t do that well with just raw text itself. Machine learning and AI techniques usually involve multiplying matrices and what not. Math stuff. Text embeddings are representations of text in large arrays, so that we can do math stuff to them. At the end of my last post, I did “announce” (kek, to my ~100 readers), that I was going to be starting a job at Cohere AI. To be very explicit again, when I did this work, I had not started the job yet. This was my own work.
Anyways, Cohere has an embeddings endpoint. And a few weeks ago, Cohere released a big data set of embeddings of all of Wikipedia. While on my break, I’ve been watching a lot of anime to pass the time, and I wondered what kind of experiments I could do revolving around that. And so I scraped the synopsis details of every anime I could get my hands on, and then pushed those through the embeddings endpoint on the Cohere API.
Scraping MAL
I use MyAnimeList.net for tracking the anime that I watch, but also they have synopses written up for all the anime they keep track of. I was stoked to even find out that they have an official API. I thought this was going to be a breeze, until I realized that they don’t have a batch or bulk list endpoint. You have to query each ID for an anime one at a time. Not only that, they don’t expose a way for you to know how many anime there are. Even more, they ask to wait between 500ms - 1000ms per request and they don’t have any formal way to actually rate limit you, so it’s more of an honor system.
The API itself is also very similar to a GraphQL style API, but it’s not. If you make a request to a single anime, you’ll get a few fields by default. You have to request additional fields with a query parameter. At first, I tried requesting every single field and had a for loop to go from 1 to 100,000 with a delay of 500ms between requests. I hit run and then a minute or two later, I get a message from a friend that knew I was doing this project, and she points out to me that MAL has stopped loading, and I check my script and am seeing dozens of timeouts. Sorry MyAnimeList! I don’t know this for sure but I’m going to guess that requesting every field must have some kind of multiplicative database query effect. So instead I reduced the fields that I was requesting just to the synopsis and alternative names and I slowed the requests to doing 1 per second.
for i := 0; i < 100000; i++ {
time.Sleep(1 * time.Second)
anime, resp, err := c.Anime.Details(ctx, i,
mal.Fields{
"alternative_titles",
"synopsis",
},
)
// ...the rest of it...
}
The very last ID that I got back was 55,254 but there were only 20,051 actual valid entries. Many of the IDs ended up returning a 404.
Making the embeddings
The Cohere API mentions that for best performance that you should use text snippets that are 512 tokens. I measured and there were something like 20 synopsis that had more than 512 tokens, which you can choose to truncate at the API level but I decided to just go for it. The other nice thing about the API is that you can actually send requests to embed in batches. So that’s what I did. I sent 20,051 synopses but did so 96 at a time. A few minutes and $20 later, I had the embeddings for all of the anime synopses!
What do I do with the embeddings? Originally, I had the idea of trying some kind of recommendation or natural language search engine over the data set, but I “ran out of time” in the sense that I ran out of attention span. Maybe I’ll come back to it. However, I did yoink the simplified nearest neighbor code from the inspiration Cohere blog post and tried a few examples.
What do you want to see?: a pokemon trainer wants to be the very best
Pokemon
Pokémon are peculiar creatures with a vast array of different abilities and appearances; many people, known as Pokémon trainers, capture and train them, often with the intent of battling others. Young Satoshi has not only dreamed of becoming a Pokémon trainer but also a "Pokémon Master," and on the arrival of his 10th birthday, he finally has a chance to make that dream a reality. Unfortunately for him, all three Pokémon available to beginning trainers have already been claimed and only Pikachu, a rebellious Electric-type Pokémon, remains. However, this chance encounter would mark the start of a lifelong friendship and an epic adventure!
Setting off on a journey to become the very best, Satoshi and Pikachu travel across beautiful, sprawling regions with their friends Kasumi, a Water-type trainer, and Takeshi, a Rock-type trainer. But danger lurks around every corner. The infamous Team Rocket is always nearby, seeking to steal powerful Pokémon through nefarious schemes. It'll be up to Satoshi and his friends to thwart their efforts as he also strives to earn the eight Pokémon Gym Badges he'll need to challenge the Pokémon League, and eventually claim the title of Pokémon Master.
[Written by MAL Rewrite]
Pokemon Best Wishes!
As with both the Advanced Generation and Diamond & Pearl series before it, the Best Wishes! series begins with only Satoshi, headed off to the Isshu region, located far away from Kanto, Johto, Houen, and Sinnoh, with his Pikachu. After he meets up with the new trainer and rival Shooty and the region's Professor Araragi, he gains traveling companions in Iris, a girl from a town known for its Dragon Pokémon, and Dent, Pokémon Connoisseur and the Grass Pokémon specialist of the three Sanyou City Gym Leaders.
Pokemon Sun & Moon
After his mother wins a free trip to the islands, Pokémon trainer Satoshi and his partner Pikachu head for Melemele Island of the beautiful Alola region, which is filled with lots of new Pokémon and even variations of familiar faces. Eager to explore the island, Satoshi and Pikachu run wild with excitement, quickly losing their way while chasing after a Pokémon. The pair eventually stumbles upon the Pokémon School, an institution where students come to learn more about these fascinating creatures.
At the school, when he and one of the students—the no-nonsense Kaki—have a run-in with the nefarious thugs of Team Skull, Satoshi discovers the overwhelming might of the Z-Moves, powerful attacks originating from the Alola region that require the trainer and Pokémon to be in sync. Later that night, he and Pikachu have an encounter with the guardian deity Pokémon of Melemele Island, the mysterious Kapu Kokeko. The Pokémon of legend bestows upon them a Z-Ring, a necessary tool in using the Z-Moves. Dazzled by their earlier battle and now in possession of a Z-Ring, Satoshi and Pikachu decide to stay behind in the Alola Region to learn and master the strength of these powerful new attacks.
Enrolling in the Pokémon School, Satoshi is joined by classmates such as Lillie, who loves Pokémon but cannot bring herself to touch them, Kaki, and many others. Between attending classes, fending off the pesky Team Rocket—who themselves have arrived in Alola to pave the way for their organization's future plans—and taking on the Island Challenge that is necessary to master the Z-Moves, Satoshi and Pikachu are in for an exciting new adventure.
[Written by MAL Rewrite]
I did also think about trying the new Cohere rerank API, but my lizard brain said this project was nearing an end. I had but one more thing to do.
Uploading to HuggingFace
I’ve downloaded things through the HuggingFace SDKs before but I’ve never uploaded anything. It was easy though. You need to setup git with large file storage and then you basically just treat everything like a git repo. ezclap.
I uploaded just a simple jsonlines file, instead of doing anything more complicated or compressed and HuggingFace was able to parse it and show a preview on the README.
Wrapping up
Cool. Thanks for reading. Check out the dataset and code to get started: https://huggingface.co/datasets/abatilo/myanimelist-embeddings
See ya’ll for the next one.