Organizing Digital Research Projects

Jonas Kreutzer
@pjkreutzer

2023-09-29

What to expect!

  1. repeatable folder structure
  2. sane filenames
  3. data handling
  4. tool agnostic

Why bother?

01. Use Default Folder Structures

.
├── analysis
├── data
│   ├── modified
│   └── raw
├── docs
└── results

02. Use Good File Names

👩‍💻 Human readable use descriptive names

🤖 Machine readable use slug naming with regex in mind

📥 Play well with default ordering

start at 01, use YYYYMMDD

File Names Examples

😭

old data.csv

data_v2thisisthemostimportantworkingversinofourdataneverdelete _update second author.xslx

figure.png

Paper version1_copy_new_reviewd_comments.docx

🥰

old_data.csv

20230929-raw_data.xslx

Fig01-scatterplot_happy.png

01-working_paper.docx

03. Treat Your Data Like the Treasure it Is

Raw data is READ ONLY

Treat any output as disposable

Separate functionality from execution

Analysis as a DAG

Document what you do

Recap

Does not spark joy

../project_name
├── 9 copied from elsewhere
│   ├── data
│   ├── old data.csv
│   └── unusedbackup.txt
├── Figure.jpg
├── Paper version1_copy_new_reviewd_comments.docx
├── cryptic folder name
│   ├── data_v2thisisthemostimportantworkingversinofourdataneverdelete _update second author.xslx
│   └── weird sub Folder
│       └── Nested fun
│           └── Some data and Analysis(copy).xlsx
├── figure.png
├── figure_new_colors.jpg
└── paper_version_Figure.png

Hopefully sparks a bit of joy

.
├── 01-working_paper.docx
├── Makefile
├── README.md
├── analysis
│   ├── 01-clean_data.py
│   ├── 02-aggregate_data.py
│   ├── 03-run_regressions.do
│   └── 04-create_figures.R
├── data
│   ├── modified
│   └── raw
│       ├── 20230929-raw_data.xslx
│       └── old_data.csv
├── project_docs
└── results
    └── figures
        └── Fig01-scatterplot_happy.png

Nice, but I do not want to remember all this / do this every time

use cookiecutters

cookiecutter gh:pjkreutzer/cookiecutter-templates --directory="simple_research"

Resources

Much of the content discussed in this talk can be found in great detail at The Turing Way – Guide for Project Design.

readme.so for quickly creating README.md.

Patrick Mineault’s Good Research Code Handbook is a great resource for more computational focused workflows, so is Nice R Code.

Cookiecutter is a python tool to quickly establish consistent project structures. Cookiecutter Data Science is a great template to modify according to Economic History research needs.

Get this presentation

bit.ly/sehm-digital_projects

Help, I am overwhelmed! This is too much. My collaborators will never. . .

  1. Start small (I suggest with project structure)

  2. Improve one file name at a time

  3. Set and communicate expectations clearly

  4. Fix small mistakes early

👋 @pjkreutzer