In Defense of the Jupyter Notebook

Jul 19, 2021

Programming paradigms are changing – and most software developers hate it. Why?

Jupyter notebooks are an open-source tool that data scientists use for everything from cleaning or visualizing data to training machine learning models. Notebooks usually run languages like Python but also support languages like R, Julia, and Scala.

Developers don't like notebooks because it makes their job harder. But, I argue that developers can learn a lot from the new paradigms that data scientists discover and use some of those tools to their advantage.

Interactive programming. Notebooks let data scientists run lines of code in any order. Program state is persistent across each cell run. If I write some code that declares a variable x, run the cell, then delete that line of code, the variable x is still stored. This pollution of the global namespace causes hard-to-identify bugs and unintended consequences.

Defense: Iteration speed is more critical for data scientists than correctness, especially in the exploration stage. The entire notebook can be re-run (like a typical program) to get more reproducible behavior.

No Separation of Concerns. Good developers know how to create good boundaries in their programs. For example, presentation logic should be separated from business logic and the underlying infrastructure logic. Notebooks put everything in one place. Documentation, presentations like graphs and tables, data cleaning and retrieval, and maybe even SQL all live in the same file.

Defense: The bundling of concerns directly responds to the difficulty in scaffolding out a new program. Separating concerns is costly - APIs need to be defined, and service boundaries need to be carved out. Often, these layers are so interlinked that the cheapest option is to handle them all simultaneously.

Difficult to version. Since Jupyter notebooks combine raw text, markup, graphs, and code, the file format needs to be verbose. This mashup of types makes it difficult to understand any differences in any of the components (e.g., code, markup, text).

Defense: This is a fundamental but solvable problem. We need a richer semantic diff (see my Twitter thread) or a more imaginative configuration format for the notebook.

Difficult to reuse code. Developers use abstractions like functions or classes to reuse code. While you can use these in notebooks, they are lengthy and awkward to put into notebook cells.

Defense: Nothing stops developers from using these abstractions in notebooks. I imagine a hybrid approach can solve the verbosity issue - notebooks and Python code together.

So, what notebook innovations can developers use?

  1. Early on, go fast. Then, go slow. Sacrificing correctness early in the development process can significantly speed up overall iteration time as the reproducibility is course-corrected.
  2. Cloud development. Notebooks take the language runtime and separate it from the development environment, making them prime candidates for cloud development (i.e., run the notebook in your browser or run the notebook locally and Python kernel in the cloud).