Modern Samizdat Libraries

Jul 28, 2023

Samizdat (“self-publishing” in Russian) was the practice of illegally copying and distributing books, manuscripts, and other materials to evade Soviet censorship.

While samizdat initially started with Russian literature and expanded to politically focused materials, it was also reimagined for hacker culture. When Bell Labs made UNIX source code illegal to distribute, the book A Commentary on the UNIX Operating System (which contained an annotated version of the source code) was retracted. Illegally copying and distributing the book was known amongst hackers as samizdat.

In the 90s, the Russian samizdat culture moved online to RuNet (the Russian Internet). Many of the efforts were focused on book digitization. Eventually, these efforts were unified under a single archive called Library Genesis.

Library Genesis has over 2.4 million non-fiction books, 80 million scientific journal articles, 2.2 million fiction books, and 2 million comics. They are distributed via mirrored sites, but you can also download the archive via torrents.

Library Genesis is obviously illegal around the world. Its existence poses a philosophical trade-off between democratic access to information and the rights and incentives of copyright holders. As recently as 2014, Sci-Hub (a similar archive for scholarly articles) was hosted on Library Genesis. Sci-Hub is also rooted in authoritarian governments (its founder, Alexandra Elbakyan, is from Kazakhstan and studied in Russia).

There’s renewed interest in the archive with the advent of large language models. Many have speculated that the “books1” and “books2” archive used to train GPT-3 is an e-book dump from Library Genesis. Sci-Hub has undoubtedly found its way into the training data of many LLMs.