About Lexman Artificial
tl;dr GPT-3 is used to generate transcripts that are then read by TorToiSe.
Transcript generation
The transcripts are generated by a series of prompt completions using OpenAI’s GPT-3. First 2-6 nouns are drawn from a large wordlist, then a guest is drawn from a set of ~200 guests I have voices for (more on that below). Then using those keywords and the guests name a summary of the podcast is generated.
Lexman Artificial Podcast
Episode 123
Guest: Elon Musk
Keywords: lightsaber, horse, bodysnatcher
Summary:
Elon Musk joins Lexman for a wide-ranging discussion about his life and work, from
his early days as a disruptive entrepreneur to his vision for the future of
technology and transportation. They also talk about his love of horses, his passion
for video games, and his experience being "bodysnatched" by aliens.
The summary is then used to generate the title.
Lexman Artificial Podcast
Guest: Elon Musk
Keywords: lightsaber, horse, bodysnatcher
Summary: Elon Musk joins Lexman for a wide-ranging discussion about his life and work, from
his early days as a disruptive entrepreneur to his vision for the future of technology and
transportation. They also talk about his love of horses, his passion for video games, and
his experience being "bodysnatched" by aliens.
Title: Elon Musk: A Ride with The Master of Disruption
A similar process is used to generate the introduction and the sponsor message. The actual conversation is where it gets a bit tricky. It's a balancing act between keeping GPT-3 generating and not going off the rails. Left to its own devices it will often close out the show early and start to read from unitialized memory, so to speak.
I've found that generating the transcript line-by-line and using a linebreak as the stop token is a very reliable way to generate transcripts, however, this quickly becomes very expensive since you pay for total tokens, prompt included. The last few lines will cost as much to generate as the entire transcript up to that point.
What I've ended up with is a compromise that has a bunch of heuristics (read regexps) that tries to detect what's going on, and adjust the prompt and parameters to keep it generating until some sort of conclusion. This allows me to create the dialogue in 2-3 requests. It also removes garbage output like when GPT-3 decides it's time to start on the second episode or brings in a third guest etc.
As you can hear if you listen to a couple of episodes this works fairly well, but there are occasional mixups where the host and guest say each other's lines or they say something like "Thanks for coming on the show! End of transcript".
The GPT-3 engine I'm using is "curie", the less capable predescesor of the "davinci" engine. This is mostly because it's an order of magnitude cheaper (one transcript costs ~$0.04 with curie and ~$0.4 with davinci) but also because I find the curie transcripts to be more entertaining. It tends to weer off in more bizzare directions while davinci keeps it more realistic and dry.
All that said I think there is a lot that could be done to improve the transcript prompts. But what I really want to do is train my own GPT model on podcast transcripts. I'm currently in the process of building up a large dataset of transcripts I can use to train it.
Text-To-Speech
To turn the transcripts into speech I use James Betker’s TorToiSe. It's the real magic behind the podcast. It takes short text snippets and some guiding voice samples to create really convincing speech.
TorToiSe is aptly named, it creates amazing results but it is very slow. On my 6900 XT one 3
minute episode takes about 30 minutes to render. It also has some interesting quirks, one
particular quirk I had to fight with to make this podcast is it's tendency to switch voice
mid-sentence. To illustrate, here's the Joe Rogan taking over from Lexman mid-sentence
(headphone warning):
This isn't a common occurrence and when generating things manually it's not a huge problem to just re-generate that particular clip but for my purposes it was a showstopper. Initially I tried to work around it by training a classifier to detect when this happens and automatically reject the clip. This wasn't very successful, it would catch obvious clips like the one above but not when the voice changed to what to me is clearly a different person but with the same intonation and mode of speech.
What I ended up doing instead was fine-tuning the TorToiSe models on several hundred hours of clips from the Lex Fridman podcast. I picked Lex's podcast because it's one of my favourites and he frequently talks about the concepts I want to explore with this project. His podcast was initially called "the Artificial Intelligence Podcast" which I also find very fitting.
I won't go into the details of how I fine-tuned the model since there are some concerns with this being used maliciously. If you're interested in fine-tuning your own model, to save you some time I will say that a critical component needed to do it has since been removed from the public release. So you won't be able to do it without re-training some of the models needed from scratch.
My fine-tuned model developed its own quirks. One quirk that you will hear a lot if you
listen to the podcast is that it likes to repeat words at the end of a sentence. For example
it will often read "Here's my conversation with X, enjoy!" as "Here's my conversation with
X, enjoy. Enjoy! ENJOY!". It also developed a speech impediment that comes out occasionally.
An example:
My model is also limited to voices represented in the training set, i.e. people that have been on the Lex Fridman podcast, trying to use guiding samples that are not part of my fine-tuning data gives very poor results so I have only ~200 guests for Lexman to speak with. But overall it's a huge improvement in how the episodes sound and more importantly it never generates something that's totally unusable so I don't need to worry about having to re-generate anything.
Now all that is needed is a pipeline that uses these two components to fully automate the podcast, but before I get in to that I'd like to give a big shoutout to the author of TorToiSe: James Betker, not only for releasing his amazing work to the world for free but also for being very generous with his time answering questions on GitHub and helping everyone, myself included.
If you're interested in this, and you probably are since you made it this far into my ramblings, you should definetly read his blog: Non_Interactive.
Automate all the things
The generation pipeline consists of a central Redis server that holds state for a bunch of different worker types. I picked Redis for this job because it gives me a key value store, job queue, and distributed locks.
There's 7 different worker types handling the different stages, transcript generation, tts, stitching the audio, generating the artwork, rendering the video and finally releasing which involves uploading the files to Cloudflare's R2 (their cheaper and faster AWS S3 competitor) and video to Vimeo and finally posting a tweet (also composed by GPT-3). The last worker has a manager role and is unique (ensured with a redis lock) and it's tasked keeping track of how many episodes we need to generate etc. Each episode also has it's own redis lock that ensures that only one worker at a time can interact with it, this helps a lot with managing the worker queues as I don't need to worry about scheduling that much.
This design was chosen because I wanted to be able to use cheap GPU workers from Google Colab and vast.ai.
Initially I thought synchronizing everything via a Google Drive folder would be a convenient way to do it since that's easily accessible from python notebooks. So I built it with a filesystem database with one directory representing each episode in the drive, this was a horrible mistake, I spent so much time duct taping everything and trying to figure out why files went missing that I finally broke down and spent a full day rewriting everything using Redis.
The episode artwork is generated using Latent Diffusion, currently the worker just picks one image from a big folder of "lexman" images I pre-generated but it would be cool to have it generate custom artwork for each episode based on the show description or title.
I have a notebook version of the TTS worker that I can just load up in Colab or any iPython instance to start generating chunks. This system allows me to generate episodes as fast as I'm willing to throw compute at it.
I'm making video versions of each episode because I couldn't find a good place (=low cost and with an API) to host the episodes in a way that would allow them to be played inline in a tweet. The only option I found is SoundCloud but their API client for Python didn't work so I chose Vimeo since I have experience with their APIs already and know them to work well.
Long-term I will probably decrease the release frequency to once or twice per day and run the TTS worker on a cheap rental somewhere, in my tests a Ryzen 3600 can eek out an episode every 12 hours or so without a GPU.
That's all there is to say about this project from a technical perspective, I think, if you have any questions feel free to drop me a line at hello at johan dash nordberg dot com or contact me on Twitter. And if you find yourself asking "but why?" you can read my artist statement.