https://github.com/jimhester/vroom

Big Idea

index csv / tsv up front, memory map it and then and use Altrep to delay field parsing on demand.

10x faster initial import than data.table, ~15x faster than readr, ~65x faster than read.delim

package time (sec) speedup throughput
vroom 1.72 65.88 971.70 MB
data.table 19.37 5.83 86.03 MB
readr 25.71 4.40 64.84 MB
read.delim 113.02 1.00 14.75 MB
.Internal(inspect(high_fares, 1, 11))
@7f9b0e08eac8 19 VECSXP g0c5 [OBJ,MARK,NAM(3),ATT] (len=11, tl=0)
  @7f9b0561c600 16 STRSXP g0c7 [MARK,NAM(3)] (len=317, tl=0)
  @7f9b0a211c00 16 STRSXP g0c7 [MARK,NAM(3)] (len=317, tl=0)
  @7f9b0614de00 16 STRSXP g0c7 [MARK,NAM(3)] (len=317, tl=0)
  @7f9b0a0e8400 16 STRSXP g0c7 [MARK,NAM(3)] (len=317, tl=0)
  @7f9b0d07a800 16 STRSXP g0c7 [MARK,NAM(3)] (len=317, tl=0)
  @7f9b0ca46c00 14 REALSXP g0c7 [MARK,NAM(3)] (len=317, tl=0) 460,250,260,242,280,275,209,260.06,323,200,250,...
  @7f9b09f4ce00 14 REALSXP g0c7 [MARK,NAM(3)] (len=317, tl=0) 0,0,0,0,0,0,0,0,0,0,0,...
  @7f9b0a218e00 14 REALSXP g0c7 [MARK,NAM(3)] (len=317, tl=0) 0.5,0,0,0,0,0,0.5,0,0,0.5,0.5,...
  @7f9b0a21e000 14 REALSXP g0c7 [MARK,NAM(3)] (len=317, tl=0) 0,50,0,45,0,25,41.8,0,0,50,0,...
  @7f9b08048c00 14 REALSXP g0c7 [MARK,NAM(3)] (len=317, tl=0) 0,0,6.55,0,0,0,4.8,0,4.8,0,0,...
  @7f9b0c3bb000 14 REALSXP g0c7 [MARK,NAM(3)] (len=317, tl=0) 460.5,300,266.55,287,280,300,256.1,260.06,327.8,250.5,250.5,...
ATTRIB:
  @7f9b0c338da0 02 LISTSXP g0c0 [MARK] 
    TAG: @7f9b0501bb00 01 SYMSXP g1c0 [MARK,NAM(3),LCK,gp=0x6000] "names" (has value)
    @7f9b0c3b54a8 16 STRSXP g1c5 [MARK,NAM(3)] (len=11, tl=0)
    TAG: @7f9b0501b8d0 01 SYMSXP g1c0 [MARK,NAM(3),LCK,gp=0x4000] "row.names" (has value)
    @7f9b0e0ea918 13 INTSXP g0c1 [MARK,NAM(3)] (len=2, tl=0) -2147483648,-317
    TAG: @7f9b0501bfd0 01 SYMSXP g1c0 [MARK,NAM(3),LCK,gp=0x4000] "class" (has value)
    @7f9b0d1097b8 16 STRSXP g1c3 [MARK,NAM(3)] (len=3, tl=0)

reading + fully materializing all vectors ~ speed of data.table

parser features

general

  • different delimiters (single character ASCII)
vroom::vroom("~/data/trip_fare_1.csv", delim = ",")

column types

  • column specifications via readr col specs
  • double, integer, character, logical, factor types
  • guessing of column types (evenly spaced sample across whole file)
  • column names

skipping

  • row skipping
  • column skipping
  • skipping commented lines
  • skipping blank lines
cat(readLines(here::here("iris.tsv")), sep = "\n")
vroom::vroom(here::here("iris.tsv"), skip = 1, comment = "#")

field parsing

  • na value(s)
  • quoted fields
  • whitespace trimming
  • double quote escapes
  • backslash escapes

features novel to vroom (wrt readr)

  • multi-threaded indexing (coarse grained)
  • multi-threaded field parsing (double, integer, logical)
  • multiple files / connections
  • streaming from connections to temp file (automatic cleanup)
vroom::vroom(here::here("mtcars.tsv.gz"))

requirements

Rcpp / dplyr / R issues

missing features

easy

  • Byte order marks
  • Windows newlines
  • user-supplied levels for factors

moderate

  • Dates, times, datetimes
  • readr’s flexible number parser
  • multiple character ASCII delimiters
  • unicode delimiters
  • Non UTF-8 input
  • automatically guessing delimiters
  • progress bars

harder

  • robustness to malformed inputs
  • multi-threading strategy requires no embedded newlines
    • Could use async per line reading / indexing for embedded newlines
  • comments and blank lines skipped only before column headers
    • requires changing field index datatype (start + length)

possible performance improvements

data: "abc""123"
std::string: abc"123
CHARSXP: abc"123
LS0tCnRpdGxlOiAidnJvb20iCmRhdGU6IDIwMTktMDItMTEKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKPGh0dHBzOi8vZ2l0aHViLmNvbS9qaW1oZXN0ZXIvdnJvb20+CgojIEJpZyBJZGVhCgppbmRleCBjc3YgLyB0c3YgdXAgZnJvbnQsIG1lbW9yeSBtYXAgaXQgYW5kIHRoZW4gYW5kIHVzZSBBbHRyZXAgdG8gZGVsYXkgZmllbGQgcGFyc2luZyBvbiBkZW1hbmQuCgp+IDEweCBmYXN0ZXIgaW5pdGlhbCBpbXBvcnQgdGhhbiBkYXRhLnRhYmxlLCB+MTV4IGZhc3RlciB0aGFuIHJlYWRyLCB+NjV4IGZhc3RlciB0aGFuIHJlYWQuZGVsaW0KCnwgcGFja2FnZSAgICB8IHRpbWUgKHNlYykgfCBzcGVlZHVwIHwgdGhyb3VnaHB1dCB8CnwgOi0tLS0tLS0tLSB8IC0tLS0tLS0tLTogfCAtLS0tLS06IHwgOi0tLS0tLS0tLSB8CnwgdnJvb20gICAgICB8ICAgICAgIDEuNzIgfCAgIDY1Ljg4IHwgOTcxLjcwIE1CICB8CnwgZGF0YS50YWJsZSB8ICAgICAgMTkuMzcgfCAgICA1LjgzIHwgODYuMDMgTUIgICB8CnwgcmVhZHIgICAgICB8ICAgICAgMjUuNzEgfCAgICA0LjQwIHwgNjQuODQgTUIgICB8CnwgcmVhZC5kZWxpbSB8ICAgICAxMTMuMDIgfCAgICAxLjAwIHwgMTQuNzUgTUIgICB8CgoKYGBge3J9CmZzOjpmaWxlX2luZm8oIn4vZGF0YS90cmlwX2ZhcmVfMS50c3YiKSRzaXplCgpmYXJlcyA8LSB2cm9vbTo6dnJvb20oIn4vZGF0YS90cmlwX2ZhcmVfMS50c3YiKQpmYXJlcwouSW50ZXJuYWwoaW5zcGVjdChmYXJlcywgMSwgMTEpKQoKaGlnaF9mYXJlcyA8LSBmYXJlc1tmYXJlcyR0b3RhbF9hbW91bnQgPiAyNTAsIF0KbGlicmFyeShkcGx5cikKZmlsdGVyIDwtIGRwbHlyOjpmaWx0ZXIKaGlnaF9mYXJlcyA8LSBmYXJlcyAlPiUgZmlsdGVyKHRvdGFsX2Ftb3VudCA+IDI1MCkKaGlnaF9mYXJlcwouSW50ZXJuYWwoaW5zcGVjdChmYXJlcywgMSwgMTEpKQouSW50ZXJuYWwoaW5zcGVjdChoaWdoX2ZhcmVzLCAxLCAxMSkpCmBgYAoKcmVhZGluZyArIGZ1bGx5IG1hdGVyaWFsaXppbmcgYWxsIHZlY3RvcnMgfiBzcGVlZCBvZiBkYXRhLnRhYmxlIAoKIyBwYXJzZXIgZmVhdHVyZXMKCiMjIGdlbmVyYWwKLSBkaWZmZXJlbnQgZGVsaW1pdGVycyAoc2luZ2xlIGNoYXJhY3RlciBBU0NJSSkKCmBgYHtyfQp2cm9vbTo6dnJvb20oIn4vZGF0YS90cmlwX2ZhcmVfMS5jc3YiLCBkZWxpbSA9ICIsIikKYGBgCgojIyBjb2x1bW4gdHlwZXMKLSBjb2x1bW4gc3BlY2lmaWNhdGlvbnMgdmlhIHJlYWRyIGNvbCBzcGVjcwotIGRvdWJsZSwgaW50ZWdlciwgY2hhcmFjdGVyLCBsb2dpY2FsLCBmYWN0b3IgdHlwZXMKLSBndWVzc2luZyBvZiBjb2x1bW4gdHlwZXMgKGV2ZW5seSBzcGFjZWQgc2FtcGxlIGFjcm9zcyB3aG9sZSBmaWxlKQotIGNvbHVtbiBuYW1lcwoKYGBge3J9CnZyb29tOjp2cm9vbShoZXJlOjpoZXJlKCJtdGNhcnMudHN2IiksCiAgY29sX3R5cGVzID0gbGlzdChjeWwgPSAiaSIsIGdlYXIgPSAiZiIsaHAgPSAiaSIsIGRpc3AgPSAiXyIsCiAgICAgICAgICAgICAgICAgICAgICAgICAgZHJhdCA9ICJfIiwgdnMgPSAibCIsIGFtID0gImwiLCBjYXJiID0gImkiKQopCgp2cm9vbTo6dnJvb20oaGVyZTo6aGVyZSgibXRjYXJzLnRzdiIpLAogIGNvbF90eXBlcyA9IHJlYWRyOjpjb2xzX29ubHkoY3lsID0gImkiKSkKYGBgCgojIyBza2lwcGluZwotIHJvdyBza2lwcGluZwotIGNvbHVtbiBza2lwcGluZwotIHNraXBwaW5nIGNvbW1lbnRlZCBsaW5lcwotIHNraXBwaW5nIGJsYW5rIGxpbmVzCgpgYGB7cn0KY2F0KHJlYWRMaW5lcyhoZXJlOjpoZXJlKCJpcmlzLnRzdiIpKSwgc2VwID0gIlxuIikKYGBgCgpgYGB7cn0KdnJvb206OnZyb29tKGhlcmU6OmhlcmUoImlyaXMudHN2IiksIHNraXAgPSAxLCBjb21tZW50ID0gIiMiKQpgYGAKCiMjIGZpZWxkIHBhcnNpbmcKLSBuYSB2YWx1ZShzKQotIHF1b3RlZCBmaWVsZHMKLSB3aGl0ZXNwYWNlIHRyaW1taW5nCi0gZG91YmxlIHF1b3RlIGVzY2FwZXMKLSBiYWNrc2xhc2ggZXNjYXBlcwoKYGBge3J9CnZyb29tOjp2cm9vbSgnCmEsYgoiIiIxIiIiLCIyLCIKTUlTU0lORywgZm9vXFwsIGJhciAKJywgCiAgZGVsaW0gPSAiLCIsIG5hID0gIk1JU1NJTkciLCAKICBlc2NhcGVfZG91YmxlID0gVFJVRSwgZXNjYXBlX2JhY2tzbGFzaCA9IFRSVUUpCmBgYAoKIyMgZmVhdHVyZXMgbm92ZWwgdG8gdnJvb20gKHdydCByZWFkcikKLSBtdWx0aS10aHJlYWRlZCBpbmRleGluZyAoY29hcnNlIGdyYWluZWQpCi0gbXVsdGktdGhyZWFkZWQgZmllbGQgcGFyc2luZyAoZG91YmxlLCBpbnRlZ2VyLCBsb2dpY2FsKQotIG11bHRpcGxlIGZpbGVzIC8gY29ubmVjdGlvbnMKCmBgYHtyfQpmaWxlcyA8LSBmczo6ZGlyX2xzKCJ+L2RhdGEvIiwgZ2xvYiA9ICIqdHJpcF9mYXJlKi5jc3YiKQpmaWxlcwpmczo6ZmlsZV9pbmZvKGZpbGVzKSRzaXplCnN1bShmczo6ZmlsZV9pbmZvKGZpbGVzKSRzaXplKQoKZGF0YSA8LSB2cm9vbTo6dnJvb20oZmlsZXMsIGRlbGltID0gIiwiKQpkYXRhCnRhaWwoZGF0YSkKZHBseXI6OnNhbXBsZV9uKGRhdGEsIDEwKQpgYGAKCi0gc3RyZWFtaW5nIGZyb20gY29ubmVjdGlvbnMgdG8gdGVtcCBmaWxlIChhdXRvbWF0aWMgY2xlYW51cCkKCmBgYHtyfQp2cm9vbTo6dnJvb20oaGVyZTo6aGVyZSgibXRjYXJzLnRzdi5neiIpKQpgYGAKCiMgcmVxdWlyZW1lbnRzCgotIFIgMy41LjAgKEFsdHJlcCkKLSBDKysxMSAobWlvIGxpYnJhcnkgZm9yIG1lbW9yeSBtYXBwaW5nKQotIFJlY2VudCAocHJldmlldykgdmVyc2lvbiBvZiBSU3R1ZGlvIChbcnN0dWRpbyM0MjEwIGZpeGVkIG9uIDIwMTktMDEtMjNdKGh0dHBzOi8vZ2l0aHViLmNvbS9yc3R1ZGlvL3JzdHVkaW8vcHVsbC80MjEwKSkKLSBpbmRleCBtZW1vcnkgcmVxdWlyZW1lbnRzICgjdG90YWwgZmllbGRzICsgMSkgKiA2NGJpdHMKCiMgUmNwcCAvIGRwbHlyIC8gUiBpc3N1ZXMKCi0gUmNwcCBleHBsaWNpdGx5IGNhbGxzIGUuZy4gYFJFQUwoKWAgaW4gYE51bWVyaWNWZWN0b3JgIGN0b3IsIG1hdGVyaWFsaXplcyBmdWxsIHZlY3RvcgogIC0gRnV0dXJlIFJjcHAgUFIgdG8gY2hhbmdlIHRoaXMgYmVoYXZpb3IgYW5kIHVzZSBgUkVBTF9FTFQoKWAgYW5kIGZyaWVuZHMgd2hlbiBwb3NzaWJsZQotIFN1cHBvcnQgZm9yIGxvZ2ljYWwgYWx0cmVwIHZlY3RvcnMgb25seSBpbiBSLWRldmVsIChzaG91bGQgYmUgaW4gUiAzLjYuMCkKCiMgbWlzc2luZyBmZWF0dXJlcwoKIyMgZWFzeQotIEJ5dGUgb3JkZXIgbWFya3MKLSBXaW5kb3dzIG5ld2xpbmVzCi0gdXNlci1zdXBwbGllZCBsZXZlbHMgZm9yIGZhY3RvcnMKCiMjIG1vZGVyYXRlCi0gRGF0ZXMsIHRpbWVzLCBkYXRldGltZXMKLSByZWFkcidzIGZsZXhpYmxlIG51bWJlciBwYXJzZXIKLSBtdWx0aXBsZSBjaGFyYWN0ZXIgQVNDSUkgZGVsaW1pdGVycwotIHVuaWNvZGUgZGVsaW1pdGVycwotIE5vbiBVVEYtOCBpbnB1dAotIGF1dG9tYXRpY2FsbHkgZ3Vlc3NpbmcgZGVsaW1pdGVycwotIHByb2dyZXNzIGJhcnMKCiMjIGhhcmRlcgotIHJvYnVzdG5lc3MgdG8gbWFsZm9ybWVkIGlucHV0cwotIG11bHRpLXRocmVhZGluZyBzdHJhdGVneSByZXF1aXJlcyBfbm9fIGVtYmVkZGVkIG5ld2xpbmVzCiAgLSBDb3VsZCB1c2UgYXN5bmMgcGVyIGxpbmUgcmVhZGluZyAvIGluZGV4aW5nIGZvciBlbWJlZGRlZCBuZXdsaW5lcwotIGNvbW1lbnRzIGFuZCBibGFuayBsaW5lcyBza2lwcGVkIG9ubHkgX2JlZm9yZV8gY29sdW1uIGhlYWRlcnMKICAtIHJlcXVpcmVzIGNoYW5naW5nIGZpZWxkIGluZGV4IGRhdGF0eXBlIChzdGFydCArIGxlbmd0aCkKCiMgcG9zc2libGUgcGVyZm9ybWFuY2UgaW1wcm92ZW1lbnRzCi0gYXN5bmMgcmVhZGluZyAvIHdyaXRpbmcgLyBpbmRleGluZyBmb3IgY29ubmVjdGlvbnMKLSBzdHJpbmcgcG9vbCBvciBzdGF0aWMgbWVtb3J5IGZvciBlc2NhcGVkIHN0cmluZ3MKICAtIGN1cnJlbnRseSBlYWNoIGZpZWxkIGlzIGR5bmFtaWNhbGx5IGFsbG9jYXRlZCBhbmQgdGhlbiBjb3BpZWQgdHdpY2UKCmBgYApkYXRhOiAiYWJjIiIxMjMiCnN0ZDo6c3RyaW5nOiBhYmMiMTIzCkNIQVJTWFA6IGFiYyIxMjMKYGBgCgotIFVzZSBDKysgdHJhaXRzIGFuZCB0ZW1wbGF0ZSBzcGVjaWZpY2F0aW9ucyBmb3IgemVybyBjb3N0IGZlYXR1cmVzCi0gbXVsdGktdGhyZWFkZWQgZmFjdG9yIGluZGV4aW5nCi0gdXNlIEFsdHJlcCBmb3IgZmFjdG9ycwo=