data_frame()
is a nice way to create data frames. It encapsulates best practices for data frames:
It never changes the type of its inputs (i.e. no more stringsAsFactors = FALSE
!)
data.frame(x = letters) %>% sapply(class)
#> x
#> "factor"
data_frame(x = letters) %>% sapply(class)
#> x
#> "character"
This makes it easier to use with list-columns:
data_frame(x = 1:3, y = list(1:5, 1:10, 1:20))
#> Source: local data frame [3 x 2]
#>
#> x y
#> 1 1 <int[5]>
#> 2 2 <int[10]>
#> 3 3 <int[20]>
List-columns are most commonly created by do()
, but they can be useful to create by hand.
It never adjusts the names of variables:
data.frame(`crazy name` = 1) %>% names()
#> [1] "crazy.name"
data_frame(`crazy name` = 1) %>% names()
#> [1] "crazy name"
It evaluates its arguments lazyily and in order:
data_frame(x = 1:5, y = x ^ 2)
#> Source: local data frame [5 x 2]
#>
#> x y
#> 1 1 1
#> 2 2 4
#> 3 3 9
#> 4 4 16
#> .. . ..
It adds tbl_df()
class to output so that if you accidentaly print a large data frames you only get the first few rows.
data_frame(x = 1:5) %>% class()
#> [1] "tbl_df" "tbl" "data.frame"
It never uses row.names()
, because the whole point of tidy data is to store variables in a consistent way, so we shouldn’t put one variable in a special attribute.
It only recycles vectors of length 1. Recycling vectors of other lengths is a frequent source of bugs.
To complement data_frame()
, dplyr provides as_data_frame()
for coercing lists into data frames. It does two things:
Checks that the input list is valid for a data frame, i.e. that each element is named, is a 1d atomic vector or list, and all elements have the same length.
Sets the class and attributes of the list to make it behave like a data frame. This modification does not require a deep copy of the input list, so is very fast.
This is much simpler than as.data.frame()
. It’s hard to explain precisely what as.data.frame()
does, but it’s similar to do.call(cbind, lapply(x, data.frame))
- i.e. it coerces each component to a data frame and then cbinds()
them all together. Consequently as_data_frame()
is much faster than as.data.frame()
:
l2 <- replicate(26, sample(100), simplify = FALSE)
names(l2) <- letters
microbenchmark::microbenchmark(
as_data_frame(l2),
as.data.frame(l2)
)
#> Unit: microseconds
#> expr min lq median uq max neval
#> as_data_frame(l2) 97.575 107.0915 114.5115 128.54 345.992 100
#> as.data.frame(l2) 1374.019 1429.3585 1457.3940 1577.61 3769.304 100
The speed of as.data.frame()
is not usually a bottleneck in interatively use, but can be a problem when combining thousands of messy inputs into one tidy data frame.
One of the reasons that dplyr is fast is that it is very careful about when it makes copies of columns. This section describes how this works, and gives you some useful tools for understanding the memory usage of data frames in R.
The first tool we’ll use is dplyr::location()
. It tells us three things about a data frame:
location(iris)
#> <0x7fc4c58a8410>
#> Variables:
#> * Sepal.Length: <0x7fc4c5103e00>
#> * Sepal.Width: <0x7fc4c5101800>
#> * Petal.Length: <0x7fc4c5100000>
#> * Petal.Width: <0x7fc4c503fc00>
#> * Species: <0x7fc4c275d280>
#> Attributes:
#> * names: <0x7fc4c58a8478>
#> * row.names: <0x7fc4c275e2a0>
#> * class: <0x7fc4c50cae28>
It’s useful to know the memory address, because if the address changes, then you know R has made a copy. Copies are bad because it takes time to copy a vector. This isn’t usually a bottleneck if you have a few thousand values, but if you have millions or tens of millions it starts to take up a significant amount of time. Unnecessary copies are also bad because they take up memory.
R tries to avoid making copies where possible. For example, if you just assign iris
to another variable, it continues to the point same location:
iris2 <- iris
location(iris2)
#> <0x7fc4c58a8410>
#> Variables:
#> * Sepal.Length: <0x7fc4c5103e00>
#> * Sepal.Width: <0x7fc4c5101800>
#> * Petal.Length: <0x7fc4c5100000>
#> * Petal.Width: <0x7fc4c503fc00>
#> * Species: <0x7fc4c275d280>
#> Attributes:
#> * names: <0x7fc4c58a8478>
#> * row.names: <0x7fc4c2657720>
#> * class: <0x7fc4c50cae28>
Rather than carefully comparing long memory locations, we can instead use the dplyr::changes()
function to highlights changes between two versions of a data frame. This shows us that iris
and iris2
are identical: both names point to the same location in memory.
changes(iris2, iris)
#> <identical>
What do you think happens if you modify a single column of iris2
? In R 3.1.0 and above, R knows enough to only modify one column and leave the others pointing to the existing location:
iris2$Sepal.Length <- iris2$Sepal.Length * 2
changes(iris, iris2)
#> Changed variables:
#> old new
#> Sepal.Length 0x7fc4c5103e00 0x7fc4c38a1200
#>
#> Changed attributes:
#> old new
#> row.names 0x7fc4c26693d0 0x7fc4c2669650
(This was not the case prior to R 3.1.0: R created a deep copy of the entire data frame.)
dplyr is similarly smart:
iris3 <- mutate(iris, Sepal.Length = Sepal.Length * 2)
changes(iris3, iris)
#> Changed variables:
#> old new
#> Sepal.Length 0x7fc4c385b400 0x7fc4c5103e00
#>
#> Changed attributes:
#> old new
#> class 0x7fc4c2c40858 0x7fc4c50cae28
#> names 0x7fc4c3a53ef0 0x7fc4c58a8478
#> row.names 0x7fc4c266c940 0x7fc4c266cbc0
It’s smart enough to create only one new column: all the other columns continue to point at their old locations. You might notice that the attributes have still been copied. This has little impact on performance because the attributes are usually short vectors and copying makes the internal dplyr code considerably simpler.
dplyr never makes copies unless it has to:
tbl_df()
and group_by()
don’t copy columns
select()
never copies columns, even when you rename them
mutate()
never copies columns, except when you modify an existing column
arrange()
must copy because you’re changing the order of every column. This is an expensive operation for big data, but you can generally avoid it using the order argument to window functions
summarise()
creates new data, but it’s usually at least an order of magnitude smaller than the original data.
This means that dplyr lets you work with data frames with very little memory overhead.
data.table takes this idea one step further than dplyr, and provides functions that modify a data table in place. This avoids the need to copy the pointers to existing columns and attributes, and provides speed up when you have many columns. dplyr doesn’t do this with data frames (although it could) because I think it’s safer to keep data immutable: all dplyr data frame methods return a new data frame, even while they share as much data as possible.