The vroom package contains one function vroom()
which is used to read all types of delimited files. A delimited file is any file in which the data is separated (delimited) by one or more characters.
The most common type of delimited files are CSV (Comma Separated Values) files, typically these files have a .csv
suffix.
To read a CSV, or other type of delimited file with vroom pass the file to vroom()
. The delimiter will be automatically guessed if it is a common delimiter. If the guessing fails or you are using a less common delimiter specify it with delim = ","
.
We have included an example CSV file in the vroom package. Access it with vroom_example("mtcars.csv")
library(vroom)
# See where the example file is stored on your machine
file <- vroom_example("mtcars.csv")
file
#> [1] ".../inst/extdata/mtcars.csv"
# Read the file, by default vroom will guess the delimiter automatically.
vroom(file)
#> # A tibble: 32 x 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda … 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun… 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> # … with 29 more rows
# You can also specify it explicitly, which is (slightly) faster, and safer if
# you know how the file is delimited.
vroom(file, delim = ",")
#> # A tibble: 32 x 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda … 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun… 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> # … with 29 more rows
If you are reading a set of (CSV or otherwise delimited) files which all have the same columns, you can pass all the filenames directly to vroom()
and it will combine them into one result.
First we will create some files to read by splitting the mtcars dataset by number of cylinders, (it is OK if you don’t currently understand this code).
mt <- tibble::rownames_to_column(mtcars, "model")
purrr::iwalk(
split(mt, mt$cyl),
~ readr::write_csv(.x, glue::glue("mtcars_{.y}.csv"))
)
We can then efficiently read them into one result by passing the filenames directly to vroom.
files <- fs::dir_ls(glob = "mtcars*csv")
files
#> mtcars_4.csv mtcars_6.csv mtcars_8.csv
vroom(files)
#> # A tibble: 32 x 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Datsun… 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 2 Merc 2… 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 3 Merc 2… 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> # … with 29 more rows
Often the filename or directory the files are in contains information about the contents, in this case the id
parameter can be used to add an extra column (in this case named path
) to the result that contains the file path.
vroom(files, id = "path")
#> # A tibble: 32 x 13
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Datsun… 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 2 Merc 2… 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 3 Merc 2… 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> # … with 29 more rows, and 1 more variable: path <chr>
vroom supports reading zip, gz, bz2 and xz compressed files automatically, just pass the filename of the compressed file to vroom.
file <- vroom_example("mtcars.csv.gz")
vroom(file)
#> # A tibble: 32 x 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda … 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun… 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> # … with 29 more rows
vroom can read files from the internet as well by passing the URL of the file to vroom.
file <- "https://raw.githubusercontent.com/jimhester/vroom/master/inst/extdata/mtcars.csv"
vroom(file)
It can even read gzipped files from the internet (although currently not the other compressed formats).
vroom uses the col_keep
and col_skip
parameters to control which columns to keep or skip in the output. Both parameters can take input in 3 different ways.
vroom(file, col_keep = c("model", "cyl", "gear"))
#> # A tibble: 32 x 3
#> model cyl gear
#> <chr> <dbl> <dbl>
#> 1 Mazda RX4 6 4
#> 2 Mazda RX4 Wag 6 4
#> 3 Datsun 710 4 4
#> # … with 29 more rows
c(1, 2, 5)
vroom(file, col_keep = c(1, 3, 11))
#> # A tibble: 32 x 3
#> model cyl gear
#> <chr> <dbl> <dbl>
#> 1 Mazda RX4 6 4
#> 2 Mazda RX4 Wag 6 4
#> 3 Datsun 710 4 4
#> # … with 29 more rows
vroom(file, col_keep = c(TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE))
#> # A tibble: 32 x 3
#> model cyl gear
#> <chr> <dbl> <dbl>
#> 1 Mazda RX4 6 4
#> 2 Mazda RX4 Wag 6 4
#> 3 Datsun 710 4 4
#> # … with 29 more rows
col_skip
works in the same way, but has the opposite effect.
vroom guesses the data types of columns as they are read, however sometimes it is necessary to change the type of one or more columns.
The available specifications are: (with single letter abbreviations in quotes)
col_logical()
‘l’, containing only T
, F
, TRUE
, FALSE
, 1
or 0
.col_integer()
‘i’, integer values.col_double()
‘d’, floating point values.col_character()
‘c’, everything else.col_factor(levels, ordered)
‘f’, a fixed set of values.col_skip()
’_, -’, don’t import this column.col_guess()
‘?’, parse using the “best” type based on the input.You can tell vroom what columns to use with the col_types()
argument in a number of ways.
If you only need to override a single column the most consise way is to use a named vector.
# read the 'hp' columns as an integer
vroom(vroom_example("mtcars.csv"), col_types = c(hp = "i"))
#> # A tibble: 32 x 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda … 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun… 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> # … with 29 more rows
# also skip reading the 'cyl' column
vroom(vroom_example("mtcars.csv"), col_types = c(hp = "i", cyl = "_"))
#> # A tibble: 32 x 11
#> model mpg disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4 21 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda RX4 Wag 21 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun 710 22.8 108 93 3.85 2.32 18.6 1 1 4 1
#> # … with 29 more rows
# also read the gears as a factor
vroom(vroom_example("mtcars.csv"), col_types = c(hp = "i", cyl = "_", gear = "f"))
#> # A tibble: 32 x 11
#> model mpg disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 Mazda RX4 21 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda RX4 Wag 21 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun 710 22.8 108 93 3.85 2.32 18.6 1 1 4 1
#> # … with 29 more rows
However you can also use the col_*()
functions in a list.
vroom(
vroom_example("mtcars.csv"),
col_types = list(hp = col_integer(), cyl = col_skip(), gear = col_factor())
)
#> # A tibble: 32 x 11
#> model mpg disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 Mazda RX4 21 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda RX4 Wag 21 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun 710 22.8 108 93 3.85 2.32 18.6 1 1 4 1
#> # … with 29 more rows
This is most useful when a column type needs additional information, such as for categorical data when you know all of the levels of a factor.
vroom(
vroom_example("mtcars.csv"),
col_types = list(gear = col_factor(levels = c(gear = c("3", "4", "5"))))
)
#> # A tibble: 32 x 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 Mazda … 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun… 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> # … with 29 more rows
vignette("benchmarks")
discusses the performance of vroom, how it compares to alternatives and how it achieves its results.