LAION, The Pile, and more datasets

Dec 14, 2022

What's actually used to train these LLMs? A brief look at some of the datasets involved.

LAION-5B

Stable Diffusion was trained on a dataset called LAION-5B ("Large-scale Artificial Intelligence Open Network"), which is comprised of 5.85 billion image-text pairs crawled from the internet. The actual crawled data comes from Common Crawl.

Common Crawl

3.15 billion pages contained in 380 TB. OpenAI's GPT-3 was, in part, trained by the data in Common Crawl. It is a non-profit founded by Gil Elbaz in 2011 (Elbaz founded Applied Semantics, which was acquired by Google in 2003 for $102mm and later became AdSense).

The Pile

A set of 22 smaller datasets was used to train GPT-J.

  • A filtered subset of Common Crawl
  • PubMed Central
  • "Books3" a collection of ebooks downloaded from Bibliotik
  • OpenWebText2 – scraped URLs from Reddit with a score of 3 or higher
  • ArXiv
  • GitHub
  • FreeLaw
  • Stack Exchange
  • USPTO Backgrounds
  • PubMed Abstracts
  • Gutenberg
  • OpenSubtitles
  • Wikipedia
  • DM Mathematics
  • Ubuntu IRC
  • BookCorpus2 – a set of 18k books from "somewhere"
  • EuroParl
  • Hacker News
  • YouTube Subtitles
  • PhilPapers
  • NIH ExPorter
  • Enron Emails

GPT-3 dataset

The book corpuses used aren't specified in the GPT-3 paper. Most likely because they are from gray hat sources like Bibliotik.

  • Common Crawl
  • OpenWebText2
  • Books1 (most likely either Gutenberg)
  • Books2 (most likely BookCorpus?)
  • Wikipedia