<- PetoriaGPT: Training GPT-J on trash

A perfect example of why I shouldn't have money

posted 2023-03-13

Me and my friends reside in a comfy Discord server called The Republic of Petoria. After the leak of LLaMA, we had an idea: what if we took all of our conversations and finetuned an AI with them?

Well… it’s possible. With $10 and a bit of hackery, we can shove our conversations into GPT-J. The idea is to feed it the last X messages from a conversation and output the responses through a bot.

Me and Luna worked together to make this happen. Shitpost engineering is great!

Step one: get the chats out of Discord

Because I’m not asking the 80+ people in this server to request a data dump and DM me some of the files, we need to use a bot to scrape the messages.

There exist multiple tools to extract conversations from Discord (most of which can be run from a bot token), but luckily, my work was already done - Luna’s bot, José, already has a j!archive command to export a zip file for me.

This zip file contains a bunch of NDJSON files, one file per channel, which we can parse and turn into a text file.

The format we came up with was <id>-<message>, where <id> would correspond to the author of the message. We decided to map user IDs to an index starting from zero, because we feared tossing random snowflakes into there would cause it to start trying to generate random unrelated users. This also serves the benefit that we can just modulo it by the amount of users in case it tries to make a number extremely large.

We also added the following processing:

  • Remove messages older than a year
  • Don’t include my messages, because I made up a significant chunk of it, increasing the file size
  • Remove URLs
    • Mentions and emotes are kept - I didn’t care enough to remove them

An example line looks like this: 0-"good morning everyone". The message content is escaped with JSON.stringify (I wish I was joking). The quotes aren’t really needed there, I just forgot to remove them. :P

We also store a mapping file, which looks like 164469624463818752=0 (mapping Discord ID to training set ID). This is so the bot knows what to set the usernames to when generating conversations.

Step two: figure out where to do this

I could run it on my own hardware, but I have an AMD GPU, and my Arch Linux partition with ROCm broke earlier this year. Luna’s PC might’ve been able to do it, but we were unsure about VRAM constraints.

We entertained three hosts to rent servers from: Google Colab, Paperspace, and RunPod. We ended up choosing RunPod as they seemed to be the cheapest for what we needed. I chucked $10 into a fresh account and purchased a machine with an A6000 in it, set up for PyTorch.

Step three: figure out how to do this

After reading a lot of blog posts, we found this GitHub repository, which seemed to do the magic for us.

We split the dataset in half and then edited line 226 (dataset =), using one half for training and one half for testing. We also had to modify line 187 and fix an error by adding a .clone() (I don’t remember where :P).

It ran in a tmux session for about five or six hours, costing me roughly $5. While it worked, I wrote a Discord bot using the provided API script.

Step four: Run it on our own hardware

I wasn’t going to pump all my money into RunPod, so we wanted to try and run it on our own hardware. First up, me!

After having to fix my Python install and almost bricking my command prompt, a realization hit me: I am a Windows user with an AMD card.

DirectML will allow me to use Torch with an AMD card, but the repository we were using depends on a bitsandbytes library that uses CUDA APIs. If I tried removing it, it would quickly eat all 32 GB of my RAM, so it was clearly important. I have no idea how to replace it (or really what it does :P), so I asked a friend to let me Tailscale SSH into his laptop - holding an RTX 2080 Mobile with 8GB of VRAM.

His Python install was, funnily enough, also broken - but after fixing it, the bot worked!

Step five: Enjoy the chaos

The Discord bot would get fed the last 5 messages of the conversation and tell it to complete the conversation. After a lot of spamming, we put a five minute cooldown onto it, but here’s what it said before that:

<Dissy> "good game"
<Dissy> "good"
<jay!> "lol"
<jay!> "lol"
<jay!> "lol"
<jay!> "lol"
<jane> "i want the pain"
<Dissy> "the way chatbots do"
<jane> "
<jane> "this shit is hilarious and weird"
<Dissy> ""i wish youd fucking die"
<Dissy> ""
<Andokai> "An AI"
<Andokai> "Can an AI scream"
<Andokai> "an AI can scream"
<Andokai> "an"
<Andokai> "an ai scream"
<beef stroganoff> "hello"
<beef stroganoff> "blood to the blood god"
<beef stroganoff> "blood god"
<beef stroganoff> "bloodgod"
<beef stroganoff> "bloodgod"
<beef stroganoff> "

Final thoughts

This sucks. Read Luna’s article too.