One of the most useful things I’ve learned during my PhD is how to use ducttape, a research workflow management system. Like many good software tools, the mindset behind ducttape is more powerful than the code itself. In this post, I’ll try to motivate the research workflow management mindset and then give you a sense of how ducttape solves the problems I present.

A Simple Experiment

Suppose we’re training a new machine learning model, and we’re given the following utilities:

  • - downloads training data from the internet.
  •, - two filtering programs based on different criteria.
  •,, - three data augmentation programs.
  • - trains a model, given some training data.

Our task is to determine which combination of data filtering and augmentation leads to the best model performance. For simplicity of exposition, we’ll assume that we can only choose one filtering program and one augmentation script to use. (We can’t use multiple augmentation scripts at the same time.)

The Most Naive Approach

The most naive approach is to manually try all the possible combinations of filters and augmentation scripts. After all, there’s only \(2 \times 3 = 6\) possible combinations. Here’s what our bash history might look like if we did this…

bash-3.2$ python data.txt
bash-3.2$ python data.txt filtered_data1.txt
bash-3.2$ python data.txt filtered_data2.txt
bash-3.2$ python filtered_data1.txt data_1_1.txt
bash-3.2$ python filtered_data1.txt data_1_2.txt
bash-3.2$ python filtered_data1.txt data_1_3.txt
bash-3.2$ python filtered_data2.txt data_2_1.txt
bash-3.2$ python filtered_data2.txt data_2_2.txt
bash-3.2$ python filtered_data2.txt data_2_3.txt
bash-3.2$ python data_1_1.txt model_1_1
bash-3.2$ python data_1_2.txt model_1_2
bash-3.2$ python data_1_3.txt model_1_3
bash-3.2$ python data_2_1.txt model_2_1
bash-3.2$ python data_2_2.txt model_2_2
bash-3.2$ python data_2_3.txt model_2_3

Obviously this is going to be tedious and error-prone. Personally, it took me three tries to type all these commands without making a typo and mixing up the numbers. And now we have a bunch of files lying around in our working directory that we probably won’t remember when we come back to this project next week.

This approach might work for a small, manageable number of experiments, but we can see how this approach could quickly become impractical, especially if we try to add more steps.

The Less Naive Approach

But the above approach is a strawman. Any decent programmer would probably do something smarter, like write a couple nested bash for-loops:

python data.txt
for X in 1 2; do
    python filter_$ data.txt filtered_data$X.txt
    for Y in 1 2 3; do
        python aug_$ filtered_data$X.txt data_$X_$Y.txt
        python data_$X_$Y.txt model_$X_$Y

Still, there’s a couple issues with this approach. First, it doesn’t give us very fine-grained control over which experiments get executed and when they get executed. Real-world workflows can have many steps each with their own configurable options. In a simple 6-step workflow with 3 options each, there’s \(3^6 = 729\) total experimental configurations. We probably don’t want to run all of those. Even if we do, we probably don’t want to come up with names for that many intermediate files or have them lying around unorganized on our filesystem.

Second, this approach is not very extensible. Let’s say I find out each data filtering program has an extra option, called “strictness,” which is set to “high” by default. If I want to run some experiments with strictness set to low, it’s non-obvious how to add that to the above bash script without wrecking my current results. You can do it, sure, but it will be painful. And every time someone wants to add another dimension to the experiments, the pain increases.

In summary, our ideal workflow looks something like the above, but with a few extra features:

  • Fine-grained control over which experiments run and the hardware used to execute each task.
  • Sensible automatic naming and organization of intermediate files.
  • Re-use of previous work when possible, and parallelization of work where possible
  • Easily extensible with new experiment dimensions/tasks without breaking results.
  • Easy to summarize results in tabular format .

Ducttape has all these nice features, which we’ll demonstrate in the next section.

Enter Ducttape

In ducttape, we organize our workflow as a directed acyclic graph. Each node in the graph is a bash script (called a task) which accepts filenames from previously completed tasks and optionally outputs files that can be consumed downstream. These input/output relationships form the edges of our graph. For example, here is a task which downloads the training data:

task download_data
> data {
  python $data

This task takes no input and outputs a single file, $data. Notice that $data is a bash variable, not an absolute path. This is because ducttape will automatically assign $data to a sensible location on the filesystem for us, depending on the current settings of the experiment dimensions.

Now that we have a task that provides the data, we can consume it in the filtering task.

task filter
< data=$data@download_data
> filtered_data 
:: filter_type=(FilterType: 1 2) {
  python filter_${filter_type}.py $data $filtered_data

This task displays all three possible argument types. Left angle brackets < specify input files from previous tasks, while right angle brackets > specify output files. (Similar to bash file pipes.) Double colons :: specify parameters, which are just bash string variables.

Our parameter $filter_type is assigned to be an experiment dimension, which is called a “branch” in ducttape. The syntax :: filter_type=(FilterType: 1 2) means that the bash variable $filter_type may be assigned a value of 1 or 2 at run-time depending on which experiment we’ve asked ducttape to run. Notice that even though this task has multiple experimental configurations, it always writes its output to the location specified by $filtered_data, which is set by ducttape to a sensible filename based on the current experiment configuration.

The task which augments the data is similar to the above:

task augment
< filtered_data=$filtered_data@filter
> augmented_data
:: aug_type=(AugType: 1 2 3) {
  python aug_${aug_type}.py $filtered_data $augmented_data

And our last task is to train our model:

task train_model
< augmented_data=$augmented_data@augment
> model {
  python $augmented_data $model

Finally, to run a particular set of experiments, we make a “plan” of execution:

plan main {
  reach train_model via (FilterType: 1) * (AugType: *)

This plan trains three models: all use the first filtering option and each uses a different augmentation option. We can easily extend this plan to target different tasks, or to take different paths through our workflow graph. If a branch is not specified, ducttape uses the “baseline” branch, which is the first option provided in the branch definition.

This is a fairly linear workflow, but many real-world tasks will have tasks which take input from many upstream tasks and provide files to many downstream tasks. You’ll notice that our workflow implementation is more verbose than our original bash script; however, all this boilerplate gives us the nice features we mentioned above, including automatic parallelization, assigning different tasks to different machines, and more.

Extending the Workflow

Supposing we ran the above experiments, we can go on to extend our workflow with new experiments without breaking our existing results.

Adding A New Augmentation Script

If we wrote a new augmentation script, this can easily be added to our workflow with one character change:

task augment
< filtered_data=$filtered_data@filter
> augmented_data
:: aug_type=(AugType: 1 2 3 4) {
  python aug_${aug_type}.py $filtered_data $augmented_data

Adding a New Filter Option

Similarly, if we wanted to add a new branch to specify the “strictness” of the filter, we could update the task like this:

task filter
< data=$data@download_data
> filtered_data 
:: filter_strictness=(FilterStrict: high low)
:: filter_type=(FilterType: 1 2) {
  python filter_${filter_type}.py $data $filtered_data --strictness=$filter_strictness

and then update our execution plan:

plan main {
  reach train_model via (FilterType: 1) * (AugType: *) * (FilterStrict: *)

This would not break any of our existing results. When we run this new plan, ducttape will assume that previous executions of the filter task were run with the strictness set to “high,” which is the baseline value for the branch.

Adding Evaluation

We can also add new tasks to our workflow which will re-use previous results:

task evaluate
< model=$model@train_model
> score {
  python $model $score


Ducttape is crufty: the latest commit was in 2015, and there are still some rough edges. That being said, it gets the job done. And since most experiments are short-lived, I’m not super worried about the tech-debt I might incur by using it. There are other workflow management frameworks out there, like Airflow and Luigi, but I’ve found those don’t have as good of a story for managing experimentation branches.

One other thing I don’t like is that it’s too easy to do the wrong thing with ducttape. For mildly complex complex workflows, it’s not immediately obvious what the right task/branch setup should be. This requires “ducttape zen,” which is discovered with time. In general, I think best practice is to implement more branches than you need and then trim down your execution space using lots of execution plans. I might talk about that in a later post.

There are also other features that I haven’t covered, such as package management, hardware configurations for each task, and summaries of results. If you’d like to learn more, feel free to read the tutorial. However, I believe this brief overview covers about 90% of my ducttape usage, and I hope it gives you a sense of the usefulness of research workflow management.