A Personal Training Corpus

Oct 12, 2022

Thomas Kinkade, the self-proclaimed "Painter of Light," produced over 1,000 paintings in his life. His company claimed that his paintings hung in "one out of every twenty American homes." Nevertheless, art critics generally wrote Kinkade off – describing his artwork as kitsch, naive imitation, gratuitous, or lacking more profound thought.

But Kinkade might survive them all. Even though Kinkade passed away in 2012, current AI models often replicate his style and generate thousands more Kinkade-like paintings every day. That's because his work is the most represented in the Stable Diffusion training set, with 9268 images (the second is Vincent Van Gogh at 8378).

I listened to a podcast today that was an autogenerated conversation between Joe Rogan and Steve Jobs (by podcast.ai). While it's a fictional interview, it captures the essence of both Steve Jobs and Joe Rogan. How? Most likely, the thousands of hours of podcast audio from Joe Rogan's show and thousands of hours of video and audio of Steve Jobs.

Our personal training data corpus might be the most important thing we produce. Content that helps machines capture our unique style, tone, and essence. Those who create the most will have the best data set to train a model in their likeness.

Latin: corpus, corporis [n.] – body, substance, person, individual