Getting started with vroom

The vroom package contains one function vroom() which is used to read all types of delimited files. A delimited file is any file in which the data is separated (delimited) by one or more characters.

The most common type of delimited files are CSV (Comma Separated Values) files, typically these files have a .csv suffix.

reading files

To read a CSV, or other type of delimited file with vroom pass the file to vroom(). The delimiter will be automatically guessed if it is a common delimiter. If the guessing fails or you are using a less common delimiter specify it with delim = ",".

We have included an example CSV file in the vroom package. Access it with vroom_example("mtcars.csv")

library(vroom)

# See where the example file is stored on your machine
file <- vroom_example("mtcars.csv")
file
#> [1] ".../inst/extdata/mtcars.csv"

# Read the file, by default vroom will guess the delimiter automatically.
vroom(file)
#> # A tibble: 32 x 12
#>   model     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda …  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda …  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun…  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> # … with 29 more rows

# You can also specify it explicitly, which is (slightly) faster, and safer if
# you know how the file is delimited.
vroom(file, delim = ",")
#> # A tibble: 32 x 12
#>   model     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda …  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda …  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun…  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> # … with 29 more rows

reading multiple files

If you are reading a set of (CSV or otherwise delimited) files which all have the same columns, you can pass all the filenames directly to vroom() and it will combine them into one result.

First we will create some files to read by splitting the mtcars dataset by number of cylinders, (it is OK if you don’t currently understand this code).

mt <- tibble::rownames_to_column(mtcars, "model")
purrr::iwalk(
  split(mt, mt$cyl),
  ~ readr::write_csv(.x, glue::glue("mtcars_{.y}.csv"))
)

We can then efficiently read them into one result by passing the filenames directly to vroom.

files <- fs::dir_ls(glob = "mtcars*csv")
files
#> mtcars_4.csv mtcars_6.csv mtcars_8.csv
vroom(files)
#> # A tibble: 32 x 12
#>   model     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Datsun…  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#> 2 Merc 2…  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#> 3 Merc 2…  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> # … with 29 more rows

Often the filename or directory the files are in contains information about the contents, in this case the id parameter can be used to add an extra column (in this case named path) to the result that contains the file path.

vroom(files, id = "path")
#> # A tibble: 32 x 13
#>   model     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Datsun…  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#> 2 Merc 2…  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#> 3 Merc 2…  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> # … with 29 more rows, and 1 more variable: path <chr>

reading compressed files

vroom supports reading zip, gz, bz2 and xz compressed files automatically, just pass the filename of the compressed file to vroom.

file <- vroom_example("mtcars.csv.gz")

vroom(file)
#> # A tibble: 32 x 12
#>   model     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda …  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda …  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun…  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> # … with 29 more rows

reading remote files

vroom can read files from the internet as well by passing the URL of the file to vroom.

file <- "https://raw.githubusercontent.com/jimhester/vroom/master/inst/extdata/mtcars.csv"
vroom(file)

It can even read gzipped files from the internet (although currently not the other compressed formats).

file <- "https://raw.githubusercontent.com/jimhester/vroom/master/inst/extdata/mtcars.csv.gz"
vroom(file)

column skipping

vroom uses the col_keep and col_skip parameters to control which columns to keep or skip in the output. Both parameters can take input in 3 different ways.

vroom(file, col_keep = c("model", "cyl", "gear"))
#> # A tibble: 32 x 3
#>   model           cyl  gear
#>   <chr>         <dbl> <dbl>
#> 1 Mazda RX4         6     4
#> 2 Mazda RX4 Wag     6     4
#> 3 Datsun 710        4     4
#> # … with 29 more rows
vroom(file, col_keep = c(1, 3, 11))
#> # A tibble: 32 x 3
#>   model           cyl  gear
#>   <chr>         <dbl> <dbl>
#> 1 Mazda RX4         6     4
#> 2 Mazda RX4 Wag     6     4
#> 3 Datsun 710        4     4
#> # … with 29 more rows
vroom(file, col_keep = c(TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE))
#> # A tibble: 32 x 3
#>   model           cyl  gear
#>   <chr>         <dbl> <dbl>
#> 1 Mazda RX4         6     4
#> 2 Mazda RX4 Wag     6     4
#> 3 Datsun 710        4     4
#> # … with 29 more rows

col_skip works in the same way, but has the opposite effect.

vroom(file, col_skip = 5:12)
#> # A tibble: 32 x 4
#>   model           mpg   cyl  disp
#>   <chr>         <dbl> <dbl> <dbl>
#> 1 Mazda RX4      21       6   160
#> 2 Mazda RX4 Wag  21       6   160
#> 3 Datsun 710     22.8     4   108
#> # … with 29 more rows

column types

vroom guesses the data types of columns as they are read, however sometimes it is necessary to change the type of one or more columns.

The available specifications are: (with single letter abbreviations in quotes)

You can tell vroom what columns to use with the col_types() argument in a number of ways.

If you only need to override a single column the most consise way is to use a named vector.

# read the 'hp' columns as an integer
vroom(vroom_example("mtcars.csv"), col_types = c(hp = "i"))
#> # A tibble: 32 x 12
#>   model     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>   <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda …  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda …  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun…  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> # … with 29 more rows

# also skip reading the 'cyl' column
vroom(vroom_example("mtcars.csv"), col_types = c(hp = "i", cyl = "_"))
#> # A tibble: 32 x 11
#>   model           mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>         <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4      21     160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda RX4 Wag  21     160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun 710     22.8   108    93  3.85  2.32  18.6     1     1     4     1
#> # … with 29 more rows

# also read the gears as a factor
vroom(vroom_example("mtcars.csv"), col_types = c(hp = "i", cyl = "_", gear = "f"))
#> # A tibble: 32 x 11
#>   model           mpg  disp    hp  drat    wt  qsec    vs    am gear   carb
#>   <chr>         <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 Mazda RX4      21     160   110  3.9   2.62  16.5     0     1 4         4
#> 2 Mazda RX4 Wag  21     160   110  3.9   2.88  17.0     0     1 4         4
#> 3 Datsun 710     22.8   108    93  3.85  2.32  18.6     1     1 4         1
#> # … with 29 more rows

However you can also use the col_*() functions in a list.

vroom(
  vroom_example("mtcars.csv"),
  col_types = list(hp = col_integer(), cyl = col_skip(), gear = col_factor())
)
#> # A tibble: 32 x 11
#>   model           mpg  disp    hp  drat    wt  qsec    vs    am gear   carb
#>   <chr>         <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 Mazda RX4      21     160   110  3.9   2.62  16.5     0     1 4         4
#> 2 Mazda RX4 Wag  21     160   110  3.9   2.88  17.0     0     1 4         4
#> 3 Datsun 710     22.8   108    93  3.85  2.32  18.6     1     1 4         1
#> # … with 29 more rows

This is most useful when a column type needs additional information, such as for categorical data when you know all of the levels of a factor.

vroom(
  vroom_example("mtcars.csv"),
  col_types = list(gear = col_factor(levels = c(gear = c("3", "4", "5"))))
)
#> # A tibble: 32 x 12
#>   model     mpg   cyl  disp    hp  drat    wt  qsec    vs    am gear   carb
#>   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 Mazda …  21       6   160   110  3.9   2.62  16.5     0     1 4         4
#> 2 Mazda …  21       6   160   110  3.9   2.88  17.0     0     1 4         4
#> 3 Datsun…  22.8     4   108    93  3.85  2.32  18.6     1     1 4         1
#> # … with 29 more rows

Further reading

vignette("benchmarks") discusses the performance of vroom, how it compares to alternatives and how it achieves its results.