Open-sourced GPT-J

Nov 9, 2021

Initially a skeptic, I'm a frequent user of GitHub Copilot, the code-autosuggestion API for VSCode. After using it, I wanted to know more about how it worked.

GitHub Copilot uses the OpenAI Codex model, which itself is based on OpenAI's GPT-3. Neither Codex or GPT-3 is open-sourced. But a group of amateur researchers have tried to recreate the model in open-source. They publish under a group they call EleutherAI.

First, they collected a dataset they call The Pile. It's actually made up of 22 smaller datasets – StackOverflow Q&A, Wikipedia, GitHub code, and even some dumps from private ebook torrent sites (OpenAI might be using the same greyhat book dataset).

Next, they were able to get TPU (tensor processing unit by Google) credits from Google Cloud to train the model. These credits were allegedly in exchange for something along the lines of (1) writing the code using TPU features and (2) including the attribution in their research papers and blog posts.

You can play with a demo of GPT-J here. It's clearly not as good as GPT-3, but the code, weights, and dataset are all open-sourced – so maybe they will improve at a faster rate. Maybe someone will come along and finely tune the model for a specific domain where it happens to work extremely well.