Project Structure in Data Science
Introduction: Why organize your work?
We’ve all heard it at one point or another in our lives: To get credit, you need to show your work. In data science, there’s a similar idea with organizing your work. Because creating an analysis usually requires writing code, it’s easy enough to point to it as the “work” that is “shown,” but anyone who’s spent time digging into someone else’s code knows that when it comes to understanding what it’s doing, there are magnitudes of difference between a well-organized repository and a series of Jupyter notebooks held together with the coding equivalent of duct tape. When working in teams, organizational norms at a bare minimum prevent a codebase from collapsing into an intractable mess, but ideally they allow large numbers of contributors to work together efficiently and with minimal interference. However, organizing data science projects isn’t just about helping other people (whether it’s you in six months, your immediate teammates, or a broader community) understand your code. Organizational norms are also powerful mental frameworks. They force you to break down hard problems into smaller, more manageable tasks. This is in part why teachers are so fixated on having their students show their work. Writing the knowns and unknowns in a physics question or each step of the long division algorithm makes you slow down, take stock of the problem, and execute each step properly. Project structures in data science work similarly. Rather than trying to clean, model, and interpret your data all at once, they allow you to focus on smaller, more manageable tasks where the outputs of one step flow smoothly into the inputs of another.
This post is a bit of a manifesto for my approach to project structure. I should note my perspective comes from an academic flavor of data science where the work is done independently or in small teams and the data sets are generally static, but I’ve sought to make my approach general. When I started writing code for data analysis, I originally followed William Noble’s guide, but over the years I’ve modified his suggestions into a more modern structure. Many of my ideas are also similar to the recommendations from the Cookiecutter project. It’s an excellent reference for data science project structure that has refined my own thinking, so I highly recommend checking it out for alternate take on this topic.
The rest of this post is divided into two parts. In the first, I’ll give an overview of my recommended project structure, and in the second, I’ll discuss some broader principles that inform how that structure is put into practice.
Directory structure
The following structure is designed for primarily computational projects using the “pipeline” model, i.e., those that begin with a limited number of distinct and static data sets and transform them over a series of operations into a set of outputs. It’s intended for ad hoc analyses of specific data sets rather than for the development of standalone, installable packages. However, its structure does encourage good software design practices like modularity via the separation of concerns, so a project developed with this framework could be converted into a re-usable pipeline without many modifications. Additionally, though it contains some Python-specific elements for demonstration purposes, this structure is compatible with any scripting language commonly used for data science, or even a mix of them!
For the experimentalists out there, this structure will likely require some modifications to accommodate projects with a wet lab component. If there aren’t many experiments, protocols and other documentation can be stored alongside each data set in a dedicated subdirectory. Projects that are primarily wet lab with shorter and more contained analyses may require a more “experiment-centric” structure, however. In other words, I don’t necessarily recommend this structure as a lab notebook in its current form, but its overall ideas of separating data, code, and outputs would still apply.
project_root/
├── data/ <- All raw data
│ ├── data_set_1/ <- Put different data sets in different subdirectories
│ ⋮
│ └── data_set_n/
├── references/ <- Data dictionaries, manuals, log entries, and all other explanatory materials
├── docs/ <- Formal documentation systems, e.g. Sphinx; not necessary for most projects
├── bin/ <- "Binaries," i.e., external programs used in this project
├── code/ <- All code written written specifically for this project
│ ├── src/ <- Re-usable functions for common tasks across project
│ │ ├── __init__.py <- Necessary to make src/ a Python package
│ │ ├── module_1.py <- Organize functions into related modules
│ │ ⋮
│ │ └── module_n.py
│ ├── scripts/ <- Code to execute individual pipeline steps
│ │ ├── clean_data.py <- Give files readable but brief names
│ │ ├── fit_models.py
│ │ └── make_plots.py
│ └── notebooks/ <- Notebook-style code for exploratory analyses
├── outputs/ <- All outputs produced by the pipeline
│ ├── clean_data/ <- Subdirectories should match the name of the script that produced them
│ ├── fit_model/
│ └── make_plots/
├── reports/ <- Formal reports compiled from individual outputs; not necessary for many projects
├── logs/ <- Log files produced by workflow managers and other programs
├── workflow.nf <- Workflow file; a Nextflow file here, but many options are available
├── README.md <- Give a high-level overview of the purpose and structure of the project
├── LICENSE.txt <- Include a license to be explicit about how others can use your code!
├── requirements.txt <- Requirements file for reproducing the project's computing environment
└── .gitignore <- Files and folders ignored by Git (or some other VCS)
Principles
This next section dives deep into some principles that inform how the previous structure is used in practice. Some are directly related to explaining the purpose of individual directories whereas others are broader ideas that outline best practices for data science. I also wanted to credit the Cookiecutter project again since their post on this topic has influenced several of these principles.
Data analysis is a DAG
Data analysis is a series of operations where the outputs of one operation become the inputs of another. It’s convenient to visualize this as a flow chart or, more mathematically, a directed acyclic graph (DAG). In this model, operations are nodes and dependencies between them are edges. (The graph is directed to reflect the one-way dependence of the operations, and the graph is acyclic to prevent circular dependencies.) While it’s technically possible to describe a pipeline down to individual lines of code using this framework, the most natural units are “tasks,” which bundle a set of related operations into standalone, executable units. I say tasks here rather than processes because in the context of the parallel computing techniques that are increasingly necessary in an age of big data, an individual task may launch thousands of processes distributed across hundreds of machines. In practice, the division of a pipeline’s operations into tasks is highly context-dependent, and part of the art of data science is choosing when to merge and split related operations into tasks based the needs of any stakeholders and the available computing resources, among other factors.
To make these concepts more concrete, let’s quickly examine a toy example. This “pipeline,” whose components are reflected in the previous project structure, consists of some number of input data sets and three tasks that create a set of outputs. More specifically, these tasks are Python scripts that clean the data, fit models, and plot the results. However, it’s more intuitive to visualize this pipeline as the following flow chart:
A simple, linear workflow.
From this perspective, it’s clear the pipeline is a directed acylic graph. For example, fit_models.py
depends on clean_data.py
, but not the other way around. Additionally, it’s acyclic since there are no “loops.” This latter property is essential since, for example, if cleaning the data depended on a model fit to the cleaned data, reproducing the analysis from scratch would be impossible.
Because this example is extremely simple, it would be straightforward to hard code the paths between the inputs and outputs of the scripts and manually run them in sequence. However, this quickly becomes unmanageable for larger and more complex pipelines. For example, it’s easy to imagine that this pipeline could grow to encompass multiple cleaning, fitting, or plotting scripts with non-linear dependencies, like the following:
A complex, non-linear workflow.
While it’s still possible to run this pipeline manually, it’s significantly more difficult to remember and execute the scripts in the proper order, diverting valuable mental resources from high-level analysis to tedious bookkeeping. Instead workflow managers are the proper “glue” for binding together the component tasks of a complex pipeline. There are many options available each with its own strengths and weaknesses, but Snakemake and Nextflow are popular in the scientific computing community. However, all focus on automating the execution of pipelines, typically via workflow languages that describe the relationships between the inputs and outputs of tasks rather than the details of their execution. This high-level approach has the added benefit of discouraging the use of hard coded paths, which greatly enhances the robustness and portability of a pipeline. Many workflow managers also include other quality-of-life features that that support their integration with environment managers and cloud computing platforms as well as the ability to only execute tasks whose inputs or code have changed to minimize the duplication of work.
Projects are isolated from the file system
When writing code for data analysis, it can be tempting for many beginners to use absolute rather than relative file paths. After all, the absolute path from the root of the file system is the address of a file, which ensures any code using an absolute path will always unambiguously refer to a specific file, no matter where it is run. However, this approach immediately breaks down when more than one machine is involved. Often these other machines are your teammates’ computers. Just as you’ll likely store the project root somewhere under your home directory on your local machine, so will your teammates, and unless you all have the same name and file system structure, the code will quickly grind to a halt. Even when working individually or on a common file system in the cloud, it is sometimes necessary to run certain steps on more powerful computing infrastructure, which creates the same problems.
That said, all paths should be relative rather than absolute. However, this then raises the question of relative to what exactly, as depending on where the code is run, relative paths will have completely different meanings in the file system. My convention is to assume all code is run from the project root because this allows relative paths to be written without all those dreaded double dots (..
) to refer to parent directories. Jupyter notebooks usually run from the directory that contains them, which requires either using double dots or implementing some additional configuration steps to change the default working directory. On the other hand, I don’t recommend Jupyter notebooks for creating portable analyses anyway (see the section on the three types of code), so following a strict convention here is not as necessary.
Sometimes it’s not possible to directly access all a project’s resources under its root, for example, if a large data set is stored on some kind of external or network drive. However, this issue is easily solved by mounting those drives and adding a link to the required files or folders in the data/
directory. Another example is linking to external programs in the bin/
directory if for whatever reason they are not on the system PATH
variable. In both cases, the project’s documentation should give explicit directions on how to configure the computing environment properly (see the section on requirements for more details).
Data is immutable
The input data for a pipeline is sacred. If it’s not stored in a public database, it should be backed up in multiple locations, preferably far away from your working directories where a stray command or click may delete it forever. Even the “working” copy of a data set should never be edited directly, not by any code and especially not by hand in a text editor. Any manipulations or transformations should be saved as separate files. This preserves the state of the original data and any downstream intermediates, creating record of work that can be examined and regenerated from any point in the pipeline.
Personally, I prefer to keep raw data in a dedicated data/
directory and any outputs in the outputs/
directory, as this further cements the distinction between the pipeline’s inputs and outputs. The structure of the data/
directory can be arbitrary, but I generally keep each data set in its own subdirectory. On the other hand, the outputs/
directory should contain one subdirectory per script named after that script to ensure a clear correspondence between the two. (Like the data/
directory, though, the structure of each outputs/
subdirectory can be arbitrary, depending on the complexity and organization of those outputs.) In some cases, it may be appropriate to group related outputs from different scripts together. For example, some plots from a single data set may require extensive computation time to generate and are better refactored into separate scripts. At the same time, you should be mindful of separating computation from visualization, though there are no hard rules here except ensuring your choices are clear through either formal documentation or intuitive naming choices.
Version control code (and only key intermediates)
Even for the smallest of projects, once they’re larger than a few scripts, it becomes convenient and even necessary to manage changes with a version control system. There are few things more frustrating than unintentionally breaking something and not being able to return the codebase to the way it was before, but I won’t harp on the benefits of version control anymore here since there are plenty of other sources that make that argument. I do want to call attention to a unique challenge that data scientists face when versioning their projects though. The files in a data science project can be largely divided into one of two buckets: code and data. Code, which encompasses literal code as well as documentation and configuration files, is small whereas data can be large, almost arbitrarily so. Version control systems aren’t designed for files the size of gigabytes, much less terabytes, so storing them in these systems can grind their operations to a halt. However, even smaller data sets can quickly balloon the size of a repository if every exploratory result or intermediate calculation is put into version control, so it’s better to avoid doing it at all.
Depending on the needs of the project, there are two main categories of solutions. The first is to not version data at all. For smaller projects, both in terms of the size of the data sets and codebase or the number of contributors, this isn’t as reckless as it may at first sound. A pipeline is essentially a recipe for generating some outputs from a set of inputs, so it should be possible to create any outputs on the fly to reflect changes in the code. For independent work with small data sets, this is often sufficient since there’s a single source of truth for any output, and if there’s any doubt about whether an output is up to date, it’s cheap to regenerate it. For team work, however, tracking the relationship between the state of a codebase and its outputs quickly becomes more complex. Using a shared file system to store outputs can alleviate some of these issues, though it can also introduce more overhead from configuration for contributors or transfer charges on cloud services.
The second approach is to use a data version control system. To my knowledge, there are currently two main options: Git LFS and DVC. Both use a similar architecture where the large files are tracked in repositories via lightweight pointers. Because the actual large files are stored in and retrieved from an external database or drive, the Git repositories stay small. Comparing the two, Git LFS is more tightly integrated with Git, but its more limited scope makes it a better fit for versioning assets when developing traditional software. DVC, on the other hand, is explicitly designed for data science, including features for handling pipelines and integrating with cloud storage providers. Regardless of the exact solution used, versioning data at scale inevitably involves trade-offs between storage and compute costs. Tracking every exploratory analysis or intermediate calculation can easily snowball the storage burden of a project, especially when the underlying data sets are large. Consequently, not versioning outputs may be the better use of resources when they are quickly regenerated or unlikely to be used again. Furthermore, for projects whose purpose is pipeline development rather than the analysis of a specific data set, a more efficient approach from both a development and storage standpoint is using small test data sets. However, ultimately these decisions depend on the available resources and goals, so teams and organizations should implement data storage policies early to prevent the accumulation of data with unknown provenance and importance.
There are three types of code you write
Code for data science is generally more ad hoc and less general than code for systems or applications. This isn’t necessarily a bad thing because data science is often more about analyzing a specific data set than developing standalone software. (Though the incorporation of software engineering practices like testing and documentation into data science workflows, especially for pipeline development, can blur this boundary somewhat.) However, even in the more loose world of data science, I still find it convenient to divide code into three levels of increasing polish that are also reflected in the previous project structure:
- Notebooks
- Scripts
- Source
The first of these are notebooks, which are great for initial explorations but less suited for formalizing code into reproducible analyses. Part of this stems from the design of notebooks as an inherently interactive and narrative way of coding and analyzing data. So while there are ways of running notebooks from the command line, its format encourages a non-linear style of coding that accumulates a lot of dead-ends and scratch work. However, the real kicker is notebooks don’t play nicely with version control systems. Because notebook code is stored alongside other metadata in a structured format (a JSON file in the case of Jupyter), the diffs between commits are often challenging to interpret. Furthermore, notebook formats also save outputs like plots and tables, which inflate the size of a repository and bog down the version control system (see the section on version control for more details). In the past I’ve used git pre-commit hooks to automatically clear the output of Jupyter notebooks before committing them to a repository, but even so the structured format of notebook files is an awkward fit for version control.
Instead I recommend using notebooks for individual explorations or highly polished communications and refactoring the good parts into standalone scripts. Ideally these could then be executed from the command line, so a user or, better, a workflow manager could run them in sequence to regenerate the outputs of the entire pipeline. The goal is to make these scripts flexible enough to accommodate data sets of varying sizes but with the same overall format, so the pipeline will run smoothly if more (or different) data is dropped in. However, ultimately scripts are intended to produce specific outputs from specific inputs, so the final level of “source code,” conventionally named src/
in the project structure, is reserved for the most general-purpose code. In practice, these are usually reusable functions for common tasks across the project. I always write detailed documentation for any code in src/
, and some modules I wrote for specific projects even reached a level of maturity where I was able to spin them out into standalone packages! As a final Python-specific note, to access code in src/
, your computing environment must be configured so Python searches the code/
directory for a matching src
module when executing import
statements. There are a few ways of doing this, but the easiest is adding the code/
directory to the PYTHONPATH
environment variable (see the section on requirements for more details).
Requirements are explicit
All code requires some kind of computing environment, one that at least runs your programming language of choice. However, projects in scientific computing and data science typically depend on a suite of other software as well. Often these are general-purpose libraries that provide plotting functionality, multidimensional arrays, or advanced mathematical operations, but sometimes it includes packages which implement domain-specific analyses. Because these dependencies are not always apparent until the code actually runs (and even then the error messages can be cryptic), it’s essential to explicitly state a project’s requirements in its documentation, for the sake of anyone (including yourself) who may want to reproduce your analyses on a different machine down the road. Additionally, while the names of the dependencies is a good starting place, it’s better to state the specific version of each to ensure greater reproducibility and to avoid compatibility issues.
Managing the dependencies of a project can be a complex task, especially when those dependencies have other dependencies in turn. Fortunately, package managers automate the process of resolving these chains of dependencies and installing compatible versions. Many package managers can also install lists of dependencies exported from other environments, allowing their re-creation on other machines. A distinct but related tool is an environment manager. Frequently different projects require mutually incompatible versions of the same libraries, so environment managers facilitate the process of maintaining these distinct computing environments and switching between them. Sometimes package and environment managers are different tools. For example, Python comes pre-installed with a package manager called pip and a separate lightweight environment manager called venv. However, because pip can only install Python packages, many data scientists use Conda, which is a combined package and environment manager that supports both Python and R packages as well as a variety of other software commonly used in scientific computing. Finally, some dependencies don’t fit neatly into a standard package manager, for example when a project requires software that isn’t maintained in a package manager’s database. Other examples are “dependencies” in the form of configurations, whether those are system-wide environment variables or the specific settings for a team’s linter. Often well-written documentation can fill in these gaps, but for cases where speed and reproducibility are essential, tools like Docker or Singularity can capture entire computing environments as lightweight containers.
Separate computation from visualization
Another convenient division in data science is between code that computes outputs and code that visualizes them. From an abstract standpoint, this falls under the broader design principle of separating concerns in a program. For example, visualization code only needs to know the structure of the data, not how it was computed, so separating the two increases the modularity of a pipeline, especially when a common set of visualizations are applied to different data sets. More concretely, though, code for computation and visualization is developed differently. Since computational steps are often resource-intensive, once the underlying computations in an analysis are finalized and applied to a data set, they are infrequently updated. Visualization steps, however, are highly dynamic. The inherently exploratory and creative nature of visualizing data contributes to this workflow, as the variables of interest and the way they are displayed may change dramatically as one’s understanding of a data set evolves. The other, less exciting, reason is that visualization is cheap but finicky. For example, it’s often fast and easy to generate a plot of a complete data set, but it’s much less straightforward to know in advance if an axis label will fit in its allotted space or the automatic placement of a legend will obscure a key data point. There’s nothing worse than having to re-run a script for minutes or hours just to fix a spacing issue or typo, so it’s better to separate visualizations from resource-intensive computations. As a final note, some visualization algorithms, like t-SNE, may require significant computation when applied to large data sets, and in these cases, it may be necessary to refactor those visualizations into standalone scripts.