The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

The tidy tools manifesto

Hadley Wickham

2023-02-21

This document lays out the consistent principles that unify the packages in the tidyverse. The goal of these principles is to provide a uniform interface so that tidyverse packages work together naturally, and once you’ve mastered one, you have a head start on mastering the others.

This is my first attempt at writing down these principles. That means that this manifesto is both aspirational and likely to change heavily in the future. Currently no packages precisely meet the design goals, and while the underlying ideas are stable, I expect their expression in prose will change substantially as I struggle to make explicit my process and thinking.

There are many other excellent packages that are not part of the tidyverse, because they are designed with a different set of underlying principles. This doesn’t make them better or worse, just different. In other words, the complement to the tidyverse is not the messyverse, but many other universes of interrelated packages.

There are four basic principles to a tidy API:

  1. Reuse existing data structures.

  2. Compose simple functions with the pipe.

  3. Embrace functional programming.

  4. Design for humans.

Reuse existing data structures

Where possible, re-use existing data structures, rather than creating custom data structures for your own package. Generally, I think it’s better to prefer common existing data structures over custom data structures, even if slightly ill-fitting.

Many R packages (e.g. ggplot2, dplyr, tidyr) work with rectangular datasets made up of observations and variables. If this is true for your package, work with data in either a data frame or tibble. Assume the data is tidy, with variables in the columns, and observations in the rows (see Tidy Data for more details).

Some packages work at a lower level, focussing on a single type of variable. For example, stringr for strings, lubridate for date/times, and forcats for factors. Generally prefer existing base R vector types, but when this is not possible, create your own using an S3 class built on top of an atomic vector or list.

If you need “non-standard scoping”, where you refer to variables inside a data frame as if they are in the global environment, prefer tidy evaluation over non-standard evaluation. See the Data frame functons section in R4DS and the metaprogramming chapters in Advanced R for more details.

Compose simple functions with the pipe

No matter how complex and polished the individual operations are, it is often the quality of the glue that most directly determines the power of the system.

— Hal Abelson

A powerful strategy for solving complex problems is to combine many simple pieces. Each piece should be easily understood in isolation, and have a standard way to combine with other pieces. In R, this strategy plays out by composing single functions with the pipe. The pipe, %>%, is a common composition tool that works across all packages.

Some things to bear in mind when writing functions:

An advantage of a pipeable API is that it is not compulsory: if you do not like using the pipe, you can compose functions in whatever way you prepare. Compare this to an operator-overloading approach (such as the + in ggplot2), or an object-composition approach (such as the ... approach in httr).

Embrace functional programming

R is a functional programming language; embrace it, don’t fight it. If you’re familiar with an object-oriented language like Python or C#, this is going to take some adjustment. But in the long run you will be much better off working with the language, rather than fighting it.

Generally, this means you should favour:

Design for humans

Programs must be written for people to read, and only incidentally for machines to execute.

— Hal Abelson

Design your API primarily so that it is easy to use by humans. Computer efficiency is a secondary concern because the bottleneck in most data analysis is thinking time, not computing time.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.