The vroom package contains one main function vroom()
which is used to read all types of delimited files. A delimited file is any file in which the data is separated (delimited) by one or more characters.
The most common type of delimited files are CSV (Comma Separated Values) files, typically these files have a .csv
suffix.
This vignette covers the following topics:
To read a CSV, or other type of delimited file with vroom pass the file to vroom()
. The delimiter will be automatically guessed if it is a common delimiter. If the guessing fails or you are using a less common delimiter specify it with the delim
parameter. (e.g. delim = ","
).
We have included an example CSV file in the vroom package for use in examples and tests. Access it with vroom_example("mtcars.csv")
# See where the example file is stored on your machine
file <- vroom_example("mtcars.csv")
file
#> [1] "/private/var/folders/dt/r5s12t392tb5sk181j3gs4zw0000gn/T/RtmpMIeGNk/Rinstcdb1feece36/vroom/extdata/mtcars.csv"
# Read the file, by default vroom will guess the delimiter automatically.
vroom(file)
#> Observations: 32
#> Variables: 12
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 32 x 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda … 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun… 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> # … with 29 more rows
# You can also specify it explicitly, which is (slightly) faster, and safer if
# you know how the file is delimited.
vroom(file, delim = ",")
#> Observations: 32
#> Variables: 12
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 32 x 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda … 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun… 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> # … with 29 more rows
If you are reading a set of files which all have the same columns, you can pass the filenames directly to vroom()
and it will combine them into one result.
First we will create some files to read by splitting the mtcars dataset by number of cylinders, (it is OK if you don’t currently understand this code).
mt <- tibble::rownames_to_column(mtcars, "model")
purrr::iwalk(
split(mt, mt$cyl),
~ vroom_write(.x, glue::glue("mtcars_{.y}.csv"), "\t")
)
We can then efficiently read them into one table by passing the filenames directly to vroom.
files <- fs::dir_ls(glob = "mtcars*csv")
files
#> mtcars_4.csv mtcars_6.csv mtcars_8.csv
vroom(files)
#> Observations: 32
#> Variables: 12
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 32 x 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Datsun… 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 2 Merc 2… 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 3 Merc 2… 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> # … with 29 more rows
Often the filename or directory where the files are stored contains information, in this case the id
parameter can be used to add an extra column to the result with the full path to each file. (in this case named path
).
vroom(files, id = "path")
#> Observations: 32
#> Variables: 13
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 32 x 13
#> path model mpg cyl disp hp drat wt qsec vs am gear
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 mtcars… Dats… 22.8 4 108 93 3.85 2.32 18.6 1 1 4
#> 2 mtcars… Merc… 24.4 4 147. 62 3.69 3.19 20 1 0 4
#> 3 mtcars… Merc… 22.8 4 141. 95 3.92 3.15 22.9 1 0 4
#> # … with 29 more rows, and 1 more variable: carb <dbl>
vroom supports reading zip, gz, bz2 and xz compressed files automatically, just pass the filename of the compressed file to vroom.
file <- vroom_example("mtcars.csv.gz")
vroom(file)
#> Observations: 32
#> Variables: 12
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 32 x 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda … 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun… 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> # … with 29 more rows
vroom can read files from the internet as well by passing the URL of the file to vroom.
It can even read gzipped files from the internet (although currently not the other compressed formats).
vroom provides the same interface for column selection and renaming as dplyr::select(). This provides very flexible and readable selections.
file <- vroom_example("mtcars.csv.gz")
vroom(file, col_select = c(model, cyl, gear))
#> Observations: 32
#> Variables: 3
#> chr [1]: model
#> dbl [2]: cyl, gear
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 32 x 3
#> model cyl gear
#> <chr> <dbl> <dbl>
#> 1 Mazda RX4 6 4
#> 2 Mazda RX4 Wag 6 4
#> 3 Datsun 710 4 4
#> # … with 29 more rows
c(1, 2, 5)
vroom(file, col_select = c(1, 3, 11))
#> Observations: 32
#> Variables: 3
#> chr [1]: model
#> dbl [2]: cyl, gear
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 32 x 3
#> model cyl gear
#> <chr> <dbl> <dbl>
#> 1 Mazda RX4 6 4
#> 2 Mazda RX4 Wag 6 4
#> 3 Datsun 710 4 4
#> # … with 29 more rows
vroom(file, col_select = starts_with("d"))
#> Observations: 32
#> Variables: 2
#> dbl [2]: disp, drat
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 32 x 2
#> disp drat
#> <dbl> <dbl>
#> 1 160 3.9
#> 2 160 3.9
#> 3 108 3.85
#> # … with 29 more rows
vroom(file, col_select = list(car = model, everything()))
#> Observations: 32
#> Variables: 12
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 32 x 12
#> car mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda … 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun… 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> # … with 29 more rows
A fixed width file can be a very compact representation of numeric data. Unfortunately, it’s also painful because you need to describe the length of every field. vroom aims to make it as easy as possible by providing a number of different ways to describe the field structure. Use vroom_fwf()
in conjunction with one of the following helper functions to read the file.
fwf_sample <- vroom_example("fwf-sample.txt")
cat(readLines(fwf_sample))
#> John Smith WA 418-Y11-4111 Mary Hartford CA 319-Z19-4341 Evan Nolan IL 219-532-c301
fwf_empty()
- Guess based on the position of empty columns.vroom_fwf(fwf_sample, fwf_empty(fwf_sample, col_names = c("first", "last", "state", "ssn")))
#> Observations: 3
#> Variables: 4
#> chr [4]: first, last, state, ssn
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 3 x 4
#> first last state ssn
#> <chr> <chr> <chr> <chr>
#> 1 John Smith WA 418-Y11-4111
#> 2 Mary Hartford CA 319-Z19-4341
#> 3 Evan Nolan IL 219-532-c301
fwf_widths()
- Use user provided set of field widths.vroom_fwf(fwf_sample, fwf_widths(c(20, 10, 12), c("name", "state", "ssn")))
#> Observations: 3
#> Variables: 3
#> chr [3]: name, state, ssn
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 3 x 3
#> name state ssn
#> <chr> <chr> <chr>
#> 1 John Smith WA 418-Y11-4111
#> 2 Mary Hartford CA 319-Z19-4341
#> 3 Evan Nolan IL 219-532-c301
fwf_positions()
- Use user provided sets of start and end positions.vroom_fwf(fwf_sample, fwf_positions(c(1, 30), c(20, 42), c("name", "ssn")))
#> Observations: 3
#> Variables: 2
#> chr [2]: name, ssn
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 3 x 2
#> name ssn
#> <chr> <chr>
#> 1 John Smith 418-Y11-4111
#> 2 Mary Hartford 319-Z19-4341
#> 3 Evan Nolan 219-532-c301
fwf_cols()
- Use user provided named widths.vroom_fwf(fwf_sample, fwf_cols(name = 20, state = 10, ssn = 12))
#> Observations: 3
#> Variables: 3
#> chr [3]: name, state, ssn
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 3 x 3
#> name state ssn
#> <chr> <chr> <chr>
#> 1 John Smith WA 418-Y11-4111
#> 2 Mary Hartford CA 319-Z19-4341
#> 3 Evan Nolan IL 219-532-c301
fwf_cols()
- Use user provided named pairs of positions.vroom_fwf(fwf_sample, fwf_cols(name = c(1, 20), ssn = c(30, 42)))
#> Observations: 3
#> Variables: 2
#> chr [2]: name, ssn
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 3 x 2
#> name ssn
#> <chr> <chr>
#> 1 John Smith 418-Y11-4111
#> 2 Mary Hartford 319-Z19-4341
#> 3 Evan Nolan 219-532-c301
vroom guesses the data types of columns as they are read, however sometimes it is necessary to change the type of one or more columns.
The available specifications are: (with single letter abbreviations in quotes)
col_logical()
‘l’, containing only T
, F
, TRUE
, FALSE
, 1
or 0
.col_integer()
‘i’, integer values.col_double()
‘d’, floating point values.col_number()
[n], numbers containing the grouping_mark
col_date(format = "")
[D]: with the locale’s date_format
.col_time(format = "")
[t]: with the locale’s time_format
.col_datetime(format = "")
[T]: ISO8601 date times.col_factor(levels, ordered)
‘f’, a fixed set of values.col_character()
‘c’, everything else.col_skip()
’_, -’, don’t import this column.col_guess()
‘?’, parse using the “best” type based on the input.You can tell vroom what columns to use with the col_types()
argument in a number of ways.
If you only need to override a single column the most concise way is to use a named vector.
# read the 'hp' columns as an integer
vroom(vroom_example("mtcars.csv"), col_types = c(hp = "i"))
#> # A tibble: 32 x 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda … 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun… 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> # … with 29 more rows
# also skip reading the 'cyl' column
vroom(vroom_example("mtcars.csv"), col_types = c(hp = "i", cyl = "_"))
#> # A tibble: 32 x 11
#> model mpg disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4 21 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda RX4 Wag 21 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun 710 22.8 108 93 3.85 2.32 18.6 1 1 4 1
#> # … with 29 more rows
# also read the gears as a factor
vroom(vroom_example("mtcars.csv"), col_types = c(hp = "i", cyl = "_", gear = "f"))
#> # A tibble: 32 x 11
#> model mpg disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 Mazda RX4 21 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda RX4 Wag 21 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun 710 22.8 108 93 3.85 2.32 18.6 1 1 4 1
#> # … with 29 more rows
However you can also use the col_*()
functions in a list.
vroom(
vroom_example("mtcars.csv"),
col_types = list(hp = col_integer(), cyl = col_skip(), gear = col_factor())
)
#> # A tibble: 32 x 11
#> model mpg disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 Mazda RX4 21 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda RX4 Wag 21 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun 710 22.8 108 93 3.85 2.32 18.6 1 1 4 1
#> # … with 29 more rows
This is most useful when a column type needs additional information, such as for categorical data when you know all of the levels of a factor.
vroom(
vroom_example("mtcars.csv"),
col_types = list(gear = col_factor(levels = c(gear = c("3", "4", "5"))))
)
#> # A tibble: 32 x 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 Mazda … 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun… 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> # … with 29 more rows
Use vroom_write()
to write delimited files, the default delimiter is tab.
Use the delim = ','
to write CSV files
For gzip, bzip2 and xz compression they will be automatically compressed if the filename ends in gz
, bz2
or xz
.
vroom_write(mtcars, "mtcars.tsv.gz")
vroom_write(mtcars, "mtcars.tsv.bz2")
vroom_write(mtcars, "mtcars.tsv.xz")
It is also possible to use other compressors, such as pigz a parallel gzip implementation, lbzip2 a parallel bzip2 implementation or pixz a parallel xz implementation by using pipe()
to create a pipe connection. The parallel versions can be considerably faster for large output files.
vignette("benchmarks")
discusses the performance of vroom, how it compares to alternatives and how it achieves its results.