library(fjoin)
read_df <- function(x) data.table::fread(x, quote = "'", data.table = FALSE)The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
fjoin builds on data.table to provide fast, flexible joins on any data frames. It slots into tidyverse pipelines and general workflows in a single line, and provides NA-safe matching by default, on-the-fly column selection, flexible row-order preservation, multiple-match handling on both sides, and an indicator column for row origin.
Stable release (CRAN):
install.packages("fjoin")Latest development version (R-universe):
install.packages("fjoin",
repos = c("https://trobx.r-universe.dev", "https://cloud.r-project.org"))lists of same-length vectors, and returns a plain data.frame, (grouped) tibble, data.table, sf, or sf-tibble. Refreshes dynamic attributes like groups, keys, agr, and bbox in the output.fjoin provides several distinctive options and controls, including:
NA unless overridden with match.na).select for efficient one-liners.indicate.mult.x and mult.y specify how to handle multiple matches on either or both sides of the join, including in semi- and anti-joins.order.fjoin_* functions( x/y style) |
dtjoin_* functions(extended DT[i] style) |
|---|---|
fjoin_inner(), fjoin_left(), fjoin_right(),fjoin_full() |
dtjoin() |
fjoin_left_semi (alias fjoin_semi), fjoin_right_semi() |
dtjoin_semi() |
fjoin_left_anti (alias fjoin_anti), fjoin_right_anti() |
dtjoin_anti() |
fjoin_cross() |
dtjoin_cross() |
The fjoin_* family consists of conventional left/right-style join functions. They are wrappers around the dtjoin_* functions (also exported), which use a generalisation of data.table’s DT[i] join syntax, and which write the join code and handle execution.
library(fjoin)
read_df <- function(x) data.table::fread(x, quote = "'", data.table = FALSE)Plain data frames joined by simple equality, using fjoin_full() for illustration.
dfP <- read_df("
id item price other_cols
NA apples 10 ...
3 bananas 20 ...
2 cherries 30 ...
1 dates 40 ...
")
dfQ <- read_df("
id quantity notes other_cols
2 5 '' ...
1 6 '' ...
3 7 '' ...
NA 8 'oranges' ...
")fjoin_full(dfQ, dfP, on = "id") id quantity notes other_cols item price R.other_cols
1 2 5 ... cherries 30 ...
2 1 6 ... dates 40 ...
3 3 7 ... bananas 20 ...
4 NA 8 oranges ... <NA> NA <NA>
5 NA NA <NA> <NA> apples 10 ...
The default match.na = FALSE prevents apples-and-oranges matches on missing values — other frameworks would join the last two rows.
Use select to restrict the join to particular columns and put them in desired order:
fjoin_full(dfQ, dfP, on = "id", select = c("item", "price", "quantity")) id item price quantity
1 2 cherries 30 5
2 1 dates 40 6
3 3 bananas 20 7
4 NA <NA> NA 8
5 NA apples 10 NA
This is much easier to write and read than the equivalent operation in dplyr, which requires two calls before the join to avoid inflating it with irrelevant columns, and a call after it to shuffle column order:
x <- dfQ |> select(id, quantity)
y <- dfP |> select(id, item, price)
full_join(x, y, join_by(id), na.matches = "never") |>
select(id, item, price, quantity)1L: left only, 2L: right only, 3L: joined from both. In Stata since 1984!1
fjoin_full(
dfQ,
dfP,
on = "id",
select = c("item", "price", "quantity"),
indicate = TRUE
) .join id item price quantity
1 3 2 cherries 30 5
2 3 1 dates 40 6
3 3 3 bananas 20 7
4 1 NA <NA> NA 8
5 2 NA apples 10 NA
fjoin_full(
dfQ,
dfP,
on = "id",
select = c("item", "price", "quantity"),
indicate = TRUE,
order = "right"
) .join id item price quantity
1 2 NA apples 10 NA
2 3 3 bananas 20 7
3 3 2 cherries 30 5
4 3 1 dates 40 6
5 1 NA <NA> NA 8
For data.table users.
fjoin_full(
dfQ,
dfP,
on = "id",
select = c("item", "price", "quantity"),
indicate = TRUE,
order = "right",
do = FALSE
).DT : x = dfQ (cast as data.table)
.i : y = dfP (cast as data.table)
Join: setDF(with(list(fjoin.temp = setDT(.DT[, fjoin.which.DT := .I][, na.omit(.SD, cols = "id"), .SDcols = c("id", "quantity", "fjoin.which.DT")][, fjoin.ind.DT := TRUE][.i, on = "id", data.frame(.join = fifelse(is.na(fjoin.ind.DT), 2L, 3L), id, item, price, quantity, fjoin.which.DT), allow.cartesian = TRUE])), rbind(fjoin.temp, setDT(.DT[!fjoin.temp$fjoin.which.DT, data.frame(id, quantity, fjoin.which.DT, .join = rep(1L, .N))]), fill = TRUE))[, fjoin.which.DT := NULL])[]
mult.x and mult.ydata.table (mult) and dplyr (multiple) can reduce the cardinality on one side of the join from many ("all") to one ("first" or "last"). fjoin (mult.x, mult.y) will do this on either side of the join, or on both sides at the same time. This example (using fjoin_left()) shows an application to temporally ordered data frames of generic “events” and “reactions”.
events <- read_df("
event_id event_ts
1 10
2 20
3 40
")
reactions <- read_df("
reaction_id reaction_ts
1 30
2 50
3 60
")fjoin_left(
events,
reactions,
on = c("event_ts < reaction_ts")
) event_id event_ts reaction_id reaction_ts
1 1 10 1 30
2 1 10 2 50
3 1 10 3 60
4 2 20 1 30
5 2 20 2 50
6 2 20 3 60
7 3 40 2 50
8 3 40 3 60
Equivalent to a one-way forward rolling join.
fjoin_left(
events,
reactions,
on = c("event_ts < reaction_ts"),
mult.x = "first"
) event_id event_ts reaction_id reaction_ts
1 1 10 1 30
2 2 20 1 30
3 3 40 2 50
Equivalent to a two-way rolling join (mutual forward/backward rolling matches).
fjoin_left(
events,
reactions,
on = c("event_ts < reaction_ts"),
mult.x = "first",
mult.y = "last"
) event_id event_ts reaction_id reaction_ts
1 1 10 NA NA
2 2 20 1 30
3 3 40 2 50
fjoin_* to dtjoin_* to data.tableA technical illustration showing:
fjoin_left() (commented out), differing in the order argumentdtjoin(), with the addition of show = TRUEdf_x <- data.frame(id_x = 1:3, row_x = paste0("x", 1:3))
df_y <- data.frame(id_y = rep(4:2, each = 2L), row_y = paste0("y", 1:6))# (1) fjoin_left(df_x, df_y, on = "id_x == id_y", mult.x = "first")
dtjoin(
df_y,
df_x,
on = "id_y == id_x",
mult = "first",
i.home = TRUE,
prefix = "R.",
show = TRUE
).DT : df_y (cast as data.table)
.i : df_x (cast as data.table)
Join: .DT[.i, on = "id_y == id_x", mult = "first", data.frame(id_x, row_x, row_y)]
id_x row_x row_y
1 1 x1 <NA>
2 2 x2 y5
3 3 x3 y3
# (2) fjoin_left(df_x, df_y, on = "id_x == id_y", mult.x = "first", order = "right")
dtjoin(
df_x,
df_y,
on = "id_x == id_y",
mult.DT = "first",
nomatch = NULL,
nomatch.DT = NA,
prefix = "R.",
show = TRUE
).DT : df_x (cast as data.table)
.i : df_y (cast as data.table)
Join: setDF(with(list(fjoin.temp = setDT(setDT(.i[, fjoin.which.i := .I][.DT[, fjoin.which.DT := .I], on = "id_y == id_x", nomatch = NULL, mult = "first", data.frame(id_x = i.id_x, row_x = i.row_x, fjoin.which.DT = i.fjoin.which.DT, fjoin.which.i)])[.i, on = "fjoin.which.i", nomatch = NULL, data.frame(id_x = i.id_y, row_x, fjoin.which.DT, row_y)])), rbind(fjoin.temp, setDT(.DT[!fjoin.temp$fjoin.which.DT, data.frame(id_x, row_x, fjoin.which.DT)]), fill = TRUE))[, fjoin.which.DT := NULL])[]
id_x row_x row_y
1 3 x3 y3
2 2 x2 y5
3 1 x1 <NA>
The difference in the order argument passed to fjoin_*() is reflected at dtjoin() level in the identity of the tables passed to .DT and .i, the values of the extended arguments nomatch.DT and mult.DT (counterparts to the familiar data.table arguments nomatch and mult on the other side of the join), and a compensating argument i.home which toggles the “home” and “foreign” table for the purposes of column order and prefixing (as well as for indicate and output class). dtjoin() in turn translates these specifications into data.table code for execution. See the dtjoin() documentation for full details of this extended DT[i]-style syntax.
The benchmark above is based on a no-frills equality join to allow comparison with merge.data.table and collapse::join(); see the Performance article for more detail. In the inner and left joins, fjoin and merge.data.table reflect a simple operation in data.table, and straightforwardly inherit its speed and robustness to the order of the tables. But fjoin performs a bit better than merge.data.table on the right and full joins. This is typical: fjoin’s solutions for join types and additional options that are not straightforward in native data.table have been developed with close attention to performance.
fjoin is a drop-in alternative to dplyr::*_join() with fast large data performance and useful options that dplyr::*_join() lacks (though the reverse is also true — see below). Joins are fairly infrequent operations, and the package name is short, so you may not feel the need to attach it:
library(dplyr)
dfQ <- as_tibble(dfQ)
dfQ |>
fjoin::fjoin_full(dfP,
on = "id",
select = c("item", "price", "quantity"),
order = "right",
indicate = TRUE
) |>
mutate(
quantity = if_else(.join == 2L, 0L, quantity),
revenue = price * quantity
)# A tibble: 5 × 6
.join id item price quantity revenue
<int> <int> <chr> <int> <int> <int>
1 2 NA apples 10 0 0
2 3 3 bananas 20 7 140
3 3 2 cherries 30 5 150
4 3 1 dates 40 6 240
5 1 NA <NA> NA 8 NA
Please note that fjoin, for now, has no equivalent of dplyr::*_join()’s relationship validation: it is silent and permissive about cardinality. It also doesn’t yet support rolling joins on unordered data, which dplyr implements elegantly via a helper function in join_by, or dedicated overlap joins (although these are easily written in terms of inequalities). These features will be added.
The implementation of joins in the data.table-backed packages dtplyr and tidytable needs maintenance. In both cases it only supports equality joins, malfunctions in the presence of same-named non-join columns on each side (try it), and silently ignores dplyr’s additional join arguments such as na.matches and multiple (as well as relationship). These packages are excellent tools that do much more than just joins, but you should be aware of these limitations.
countries <- read_df("
country_id country_name country_shape
1 'Country A' 'POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))'
2 'Country B' 'POLYGON ((1 1, 2 1, 2 2, 1 2, 1 1))'
3 'Country C' 'POLYGON ((2 2, 3 2, 3 3, 2 3, 2 2))'
") |> sf::st_as_sf(wkt = "country_shape", crs = 4326)
capitals <- read_df("
country_id capital_name capital_loc
2 'City B' 'POINT (1.5 1.5)'
3 'City C' 'POINT (2.5 2.5)'
4 'City D' 'POINT (3.5 3.5)'
") |> sf::st_as_sf(wkt = "capital_loc", crs = 4326)fjoin smoothly accommodates joins involving sf data frames. In particular, joins between two sf objects work as you would hope:
fjoin_inner(countries, capitals, on = "country_id")Simple feature collection with 2 features and 3 fields
Active geometry column: country_shape
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 1 ymin: 1 xmax: 3 ymax: 3
Geodetic CRS: WGS 84
country_id country_name country_shape capital_name
1 2 Country B POLYGON ((1 1, 2 1, 2 2, 1 ... City B
2 3 Country C POLYGON ((2 2, 3 2, 3 3, 2 ... City C
capital_loc
1 POINT (1.5 1.5)
2 POINT (2.5 2.5)
This is useful in workflows where you want to hold multiple geometries in the same sf data frame. In dplyr such joins are prohibited:
try(dplyr::inner_join(countries, capitals, by = "country_id"))Error : y should not have class sf; for spatial joins, use st_join
In addition, fjoin always detects and refreshes sfc-class columns in the join output, regardless of whether the inputs and output have sf class (and whether those columns were/are active geometries). This avoids stale bounding boxes and ensures that values at non-matching rows are converted from NULL to a valid empty geometry. For example, here is the sfc column capital_loc from the input on the right, after a left join in which the inputs are plain data frames instead of sfs:
fjoin_left(as.data.frame(countries), as.data.frame(capitals), on = "country_id")$capital_locGeometry set for 3 features (with 1 geometry empty)
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 1.5 ymin: 1.5 xmax: 2.5 ymax: 2.5
Geodetic CRS: WGS 84
POINT EMPTY
POINT (1.5 1.5)
POINT (2.5 2.5)
fjoin automates joins that are challenging or laborious to write in data.table, while solving frustrations such as garbled join columns in inequality joins and the lack of an effective incomparables argument, and providing other useful options. Even for very simple joins, there is no reason not to use it, since it has negligible overhead and if anything will actually slightly outperform merge (see above) and even native data.table (see below).
That said, fjoin is not a comprehensive wrapper for data.table’s rich join functionality. Some things it cannot do are:
j (including by = .EACHI aggregations), rather than simply selecting themDT[i, on = <on> , v := i.v]You do have the option of setting do = FALSE, copying the console output, and editing the j-expression(s). You can also “plonk” joined columns by reference using the pattern DT[, v := fjoin_left(DT, i, on = <on>, select = "v")$v], which can often usefully be combined with the ".join" indicator:
library(data.table)
dtQ <- as.data.table(dfQ)
dtP <- as.data.table(dfP)
dtP[, revenue :=
price * fjoin_left(
dtP,
dtQ,
on = "id",
select = c("quantity"),
indicate = TRUE
)[.join == 1L, quantity := 0L]$quantity][] id item price other_cols revenue
<int> <char> <int> <char> <int>
1: NA apples 10 ... 0
2: 3 bananas 20 ... 140
3: 2 cherries 30 ... 150
4: 1 dates 40 ... 240
The package actually began life as a tool to solve data.table’s garbling of join columns in non-equi joins. It still does that:
dt1 <- data.table(t=c(5L,25L,45L))
dt2 <- data.table(t_start=c(1L,21L), t_end=c(10L,30L))Here is a range join with fjoin:
dtjoin(dt2, dt1, on=c("t_start <= t", "t_end >= t"), show = TRUE).DT : dt2
.i : dt1
Join: setDT(.DT[.i, on = c("t_start <= t", "t_end >= t"), data.frame(t_start = x.t_start, t_end = x.t_end, t)])[]
t_start t_end t
<int> <int> <int>
1: 1 10 5
2: 21 30 25
3: NA NA 45
Compare the default output from data.table:
dt2[dt1, on=.(t_start <= t, t_end >= t)] t_start t_end
<int> <int>
1: 5 5
2: 25 25
3: 45 45
Notice that fjoin uses data.frame() in j (coupled with an outer setDT()) instead of the more usual list(), even when the required output is a data.table. This is to sidestep an unnecessary deep copy of the joined columns currently made by data.table in this case. Likewise, fjoin avoids deep copies on the way in and out by “shallow-casting” inputs and outputs to and from data.tables as necessary. Shallow conversion is not a safe general way of operating with foreign objects in data.table, but it works for joins, because the input vectors only need to be read, and the output vectors are guaranteed to be unshared.
One consequence of this is that fjoin can be more efficient than data.table itself with standard interactive idioms. Consider the following (exaggerated) example, where the i expression is a data.frame and we select columns in j with list() (aliased .()):
n <- 1e6L; ncol_dt <- 2L; ncol_df <- 10L
dt <- data.table(id = rep(1:n, each = 5L), matrix(runif(n * ncol_dt), ncol = ncol_dt))
df <- data.frame(id = 1:n, matrix(runif(n * ncol_df), ncol = ncol_df))
bench::mark(
data.table = dt[df, on = .(id), .(id, V1, V2, X1, X3, X5, X7, X9)],
fjoin = dtjoin(dt, df, on = "id", select.i = c("X1", "X3", "X5", "X7", "X9")),
iterations = 3,
check = TRUE
) |> summary() |> subset(select = c("expression", "n_itr", "median", "mem_alloc"))# A tibble: 2 × 3
expression median mem_alloc
<bch:expr> <bch:tm> <bch:byt>
1 data.table 374ms 752MB
2 fjoin 172ms 367MB
Here fjoin avoids a call to as.data.table.data.frame on the way in, and a call to as.data.table.list on the way out, both of which (currently) always deep-copy. The bench::mark memory measurements exclude C-level allocations, but the difference between them reflects these R-level copies. This behaviour will eventually change in data.table, at which point fjoin will revert to the more familiar j = list() idiom.
Finally, please note that dtjoin() differs from data.table in its preservation of keys. In a data.table DT[i] join, the output inherits the key of DT provided it happens to remain sorted on those columns; this is consistent with data.table’s conception of joins as a subsetting-like operation on DT, even though it is the i-table that dictates the row order. With dtjoin(), the output always inherits the key of .i (unless the non-joining rows of .DT are appended, in which case the key is NULL since sortedness on .i’s key columns can no longer be guaranteed). This design choice ensures intuitive behaviour of the fjoin_*() functions: for example, fjoin_left(x, y) with the default order = "left" preserves x’s key, as a user would surely expect.
With match.na = FALSE (the default), fjoin inspects the data and, if the join permits, chooses which table to omit NA-containing rows from using a heuristic. The only thing that is not reflected in the data.table code that fjoin produces are the input and output handling steps:
data.table inputs as data.tables (new objects borrowing columns from the input)sf class to data.frame outputs where appropriate, including setting groups and agr attributesdata.table outputs as described abovesfc columns regardless of output classdata.table inputs used as-isThese steps ensure that fjoin handles object classes efficiently (no data copying) while leaving inputs intact.
In other respects, the package is a constructor of data.table code to a set of carefully thought-out solutions. fjoin pays a lot of attention to doing this well and is also very thoroughly tested.
See p. 99 of the Stata 1 reference manual, which has been linked to on Statalist and can be found by searching online. Thanks to Nick Cox for this information.↩︎
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.