The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This post has referred to a vignette from dplyr
, you can
find it in https://dplyr.tidyverse.org/articles/two-table.html.
We’ll try to display how to join data tables in this vignette. First,
load the packages we need and get some data.
library(tidyfst)
library(nycflights13)
flights2 <- flights %>%
select_dt(year,month,day, hour, origin, dest, tailnum, carrier)
Do a left join with a simple:
flights2 %>%
left_join_dt(airlines)
#> Joining by: carrier
#> Key: <carrier>
#> carrier year month day hour origin dest tailnum
#> <char> <int> <int> <int> <num> <char> <char> <char>
#> 1: 9E 2013 1 1 8 JFK MSP N915XJ
#> 2: 9E 2013 1 1 15 JFK IAD N8444F
#> 3: 9E 2013 1 1 14 JFK BUF N920XJ
#> 4: 9E 2013 1 1 15 JFK SYR N8409N
#> 5: 9E 2013 1 1 15 JFK ROC N8631E
#> ---
#> 336772: YV 2013 9 29 16 LGA IAD N518LR
#> 336773: YV 2013 9 29 17 LGA CLT N932LR
#> 336774: YV 2013 9 30 16 LGA IAD N510MJ
#> 336775: YV 2013 9 30 17 LGA CLT N905FJ
#> 336776: YV 2013 9 30 20 LGA CLT N924FJ
#> 1 variable(s) not shown: [name <char>]
Join works the same as dplyr
:
flights2 %>% left_join_dt(weather)
#> Joining by: year,month,day,hour,origin
#> Key: <year, month, day, hour, origin>
#> year month day hour origin dest tailnum carrier temp
#> <int> <int> <int> <int> <char> <char> <char> <char> <num>
#> 1: 2013 1 1 5 EWR IAH N14228 UA 39.02
#> 2: 2013 1 1 5 EWR ORD N39463 UA 39.02
#> 3: 2013 1 1 5 JFK MIA N619AA AA 39.02
#> 4: 2013 1 1 5 JFK BQN N804JB B6 39.02
#> 5: 2013 1 1 5 JFK BOS N708JB B6 39.02
#> ---
#> 336772: 2013 12 31 23 EWR SJU N651JB B6 NA
#> 336773: 2013 12 31 23 JFK BQN N566JB B6 NA
#> 336774: 2013 12 31 23 JFK SJU N713TW DL NA
#> 336775: 2013 12 31 23 JFK SJU N509JB B6 NA
#> 336776: 2013 12 31 23 JFK PSE N665JB B6 NA
#> 9 variable(s) not shown: [dewp <num>, humid <num>, wind_dir <num>, wind_speed <num>, wind_gust <num>, precip <num>, pressure <num>, visib <num>, time_hour <POSc>]
flights2 %>% left_join_dt(planes, by = "tailnum")
#> Key: <tailnum>
#> tailnum year.x month day hour origin dest carrier year.y
#> <char> <int> <int> <int> <num> <char> <char> <char> <int>
#> 1: <NA> 2013 1 2 15 JFK LAX AA NA
#> 2: <NA> 2013 1 2 16 EWR ORD UA NA
#> 3: <NA> 2013 1 3 8 EWR MIA UA NA
#> 4: <NA> 2013 1 3 6 EWR DFW UA NA
#> 5: <NA> 2013 1 4 8 JFK DCA 9E NA
#> ---
#> 336772: N9EAMQ 2013 9 27 16 LGA ATL MQ NA
#> 336773: N9EAMQ 2013 9 29 12 LGA BNA MQ NA
#> 336774: N9EAMQ 2013 9 29 18 LGA CMH MQ NA
#> 336775: N9EAMQ 2013 9 30 11 JFK DCA MQ NA
#> 336776: N9EAMQ 2013 9 30 14 JFK TPA MQ NA
#> 7 variable(s) not shown: [type <char>, manufacturer <char>, model <char>, engines <int>, seats <int>, speed <int>, engine <char>]
flights2 %>% left_join_dt(airports, c("dest" = "faa"))
#> Key: <dest>
#> dest year month day hour origin tailnum carrier
#> <char> <int> <int> <int> <num> <char> <char> <char>
#> 1: ABQ 2013 10 1 20 JFK N554JB B6
#> 2: ABQ 2013 10 2 20 JFK N607JB B6
#> 3: ABQ 2013 10 3 20 JFK N591JB B6
#> 4: ABQ 2013 10 4 20 JFK N662JB B6
#> 5: ABQ 2013 10 5 19 JFK N580JB B6
#> ---
#> 336772: XNA 2013 9 29 17 LGA N725MQ MQ
#> 336773: XNA 2013 9 30 7 LGA N735MQ MQ
#> 336774: XNA 2013 9 30 8 EWR N14117 EV
#> 336775: XNA 2013 9 30 15 LGA N725MQ MQ
#> 336776: XNA 2013 9 30 17 LGA N720MQ MQ
#> 7 variable(s) not shown: [name <char>, lat <num>, lon <num>, alt <num>, tz <num>, dst <char>, tzone <char>]
flights2 %>% left_join_dt(airports, c("origin" = "faa"))
#> Key: <origin>
#> origin year month day hour dest tailnum carrier
#> <char> <int> <int> <int> <num> <char> <char> <char>
#> 1: EWR 2013 1 1 5 IAH N14228 UA
#> 2: EWR 2013 1 1 5 ORD N39463 UA
#> 3: EWR 2013 1 1 6 FLL N516JB B6
#> 4: EWR 2013 1 1 6 SFO N53441 UA
#> 5: EWR 2013 1 1 6 LAS N76515 UA
#> ---
#> 336772: LGA 2013 9 30 18 BNA N740EV EV
#> 336773: LGA 2013 9 30 22 SYR <NA> 9E
#> 336774: LGA 2013 9 30 12 BNA N535MQ MQ
#> 336775: LGA 2013 9 30 11 CLE N511MQ MQ
#> 336776: LGA 2013 9 30 8 RDU N839MQ MQ
#> 7 variable(s) not shown: [name <char>, lat <num>, lon <num>, alt <num>, tz <num>, dst <char>, tzone <char>]
df1 <- data.table(x = c(1, 2), y = 2:1)
df2 <- data.table(x = c(1, 3), a = 10, b = "a")
df1 %>% inner_join_dt(df2)
#> Joining by: x
#> Key: <x>
#> x y a b
#> <num> <int> <num> <char>
#> 1: 1 2 10 a
df1 %>% left_join_dt(df2)
#> Joining by: x
#> Key: <x>
#> x y a b
#> <num> <int> <num> <char>
#> 1: 1 2 10 a
#> 2: 2 1 NA <NA>
df1 %>% right_join_dt(df2)
#> Joining by: x
#> Key: <x>
#> x y a b
#> <num> <int> <num> <char>
#> 1: 1 2 10 a
#> 2: 3 NA 10 a
df1 %>% full_join_dt(df2)
#> Joining by: x
#> Key: <x>
#> x y a b
#> <num> <int> <num> <char>
#> 1: 1 2 10 a
#> 2: 2 1 NA <NA>
#> 3: 3 NA 10 a
If all you have is a data.frame or tibble, you have no need to change the format. Feed the data directly:
df1 <- data.frame(x = c(1, 1, 2), y = 1:3)
df2 <- data.frame(x = c(1, 1, 2), z = c("a", "b", "a"))
df1 %>% left_join_dt(df2)
#> Joining by: x
#> Key: <x>
#> x y z
#> <num> <int> <char>
#> 1: 1 1 a
#> 2: 1 1 b
#> 3: 1 2 a
#> 4: 1 2 b
#> 5: 2 3 a
The “_dt” suffix should remind you that this is backed up by
data.table
and will always return a data.table in the
end.
Filtering joins have also been supported in tidyfst
.
flights %>%
anti_join_dt(planes, by = "tailnum") %>%
count_dt(tailnum, sort = TRUE)
#> tailnum n
#> <char> <int>
#> 1: <NA> 2512
#> 2: N725MQ 575
#> 3: N722MQ 513
#> 4: N723MQ 507
#> 5: N713MQ 483
#> ---
#> 718: N7BKAA 1
#> 719: N7CAAA 1
#> 720: N5FCAA 1
#> 721: N5ERAA 1
#> 722: N647MQ 1
Other examples (semi_join_dt()
and
anti_join_dt()
never duplicate; they only ever remove
observations.):
df1 <- data.frame(x = c(1, 1, 3, 4), y = 1:4)
df2 <- data.frame(x = c(1, 1, 2), z = c("a", "b", "a"))
# Four rows to start with:
df1 %>% nrow()
#> [1] 4
# And we get four rows after the join
df1 %>% inner_join_dt(df2, by = "x") %>% nrow()
#> [1] 4
# But only two rows actually match
df1 %>% semi_join_dt(df2, by = "x") %>% nrow()
#> [1] 2
For set operations, wrap data.table
’s function directly,
but the functions will automatically turn any data.frame into
data.table. Examples are listed as below:
x = iris[c(2,3,3,4),]
x2 = iris[2:4,]
y = iris[c(3:5),]
intersect_dt(x, y) # intersect
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <num> <num> <num> <num> <fctr>
#> 1: 4.7 3.2 1.3 0.2 setosa
#> 2: 4.6 3.1 1.5 0.2 setosa
intersect_dt(x, y, all=TRUE) # intersect all
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <num> <num> <num> <num> <fctr>
#> 1: 4.7 3.2 1.3 0.2 setosa
#> 2: 4.6 3.1 1.5 0.2 setosa
setdiff_dt(x, y) # except
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <num> <num> <num> <num> <fctr>
#> 1: 4.9 3 1.4 0.2 setosa
setdiff_dt(x, y, all=TRUE) # except all
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <num> <num> <num> <num> <fctr>
#> 1: 4.9 3.0 1.4 0.2 setosa
#> 2: 4.7 3.2 1.3 0.2 setosa
union_dt(x, y) # union
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <num> <num> <num> <num> <fctr>
#> 1: 4.9 3.0 1.4 0.2 setosa
#> 2: 4.7 3.2 1.3 0.2 setosa
#> 3: 4.6 3.1 1.5 0.2 setosa
#> 4: 5.0 3.6 1.4 0.2 setosa
union_dt(x, y, all=TRUE) # union all
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <num> <num> <num> <num> <fctr>
#> 1: 4.9 3.0 1.4 0.2 setosa
#> 2: 4.7 3.2 1.3 0.2 setosa
#> 3: 4.7 3.2 1.3 0.2 setosa
#> 4: 4.6 3.1 1.5 0.2 setosa
#> 5: 4.7 3.2 1.3 0.2 setosa
#> 6: 4.6 3.1 1.5 0.2 setosa
#> 7: 5.0 3.6 1.4 0.2 setosa
setequal_dt(x, x2, all=FALSE) # setequal
#> [1] TRUE
setequal_dt(x, x2)
#> [1] FALSE
For more details, just find the help from data.table
using ?setops
.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.