The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Three blocks accept expressions: expression filter, mutate, and summarize.
Helper functions make expressions more powerful by applying operations to multiple columns at once. This vignette covers:
across(), where(),
if_any(), if_all(), pick()See also: dplyr column-wise operations and dplyr window functions.
Expression filter block: Logical expressions to keep rows
>, <,
==, !=, >=,
<=& (AND) or | (OR)if_any(), if_all()Mutate block: Create or modify columns
across() to transform multiple columnsSummarize block: Aggregate data
mean(), sum(),
n()across() for multiple columns,
pick() for custom functions+, -, *, /,
^ - basic operations%/% - integer division%% - remainder (modulo)log(), log2(), log10() -
logarithmsexp() - exponentialsqrt() - square rootlag() - previous valueslead() - next valuescumsum() - cumulative sumcummean() - cumulative meancummin(), cummax() - cumulative
min/maxcumprod() - cumulative productrow_number() - sequential ranks (1, 2, 3, 4)min_rank() - ranks with ties (1, 2, 2, 4)dense_rank() - ranks without gaps (1, 2, 2, 3)percent_rank() - percentile (0 to 1)ntile(n) - divide into n bins<, >, <=,
>=, ==, !=& (and), | (or), !
(not)%in% - test membership# Calculate percentage
across(c(hp, wt), \(x) x / sum(x) * 100)
# Lag differences
mpg - lag(mpg)
# Cumulative sums by group (use by parameter)
cumsum(sales)
# Rank values
min_rank(desc(hp))
See dplyr window functions for more.
mean() - averagemedian() - middle valuesd() - standard deviationIQR() - interquartile rangemad() - median absolute deviationmin(), max() - minimum and maximumquantile(x, 0.25) - percentilesfirst(), last(), nth(x, 2) -
positional valuesn() - count rowsn_distinct() - count unique valuessum(!is.na(x)) - count non-missingsum() - totalprod() - product# Basic statistics
across(where(is.numeric), list(mean = mean, sd = sd))
# Count by group (use by parameter)
n()
# Multiple stats
list(
avg = mean(hp),
min = min(hp),
max = max(hp),
count = n()
)
For handling missing values, add na.rm = TRUE:
mean(hp, na.rm = TRUE)
across(where(is.numeric), \(x) mean(x, na.rm = TRUE))
See dplyr summarise for more.
Select columns by name pattern or type (used inside
across(), if_any(),
if_all()):
everything() - all columnsstarts_with("prefix") - columns starting with
prefixends_with("suffix") - columns ending with suffixcontains("text") - columns containing textwhere(is.numeric) - columns by type
(is.character, is.factor)c(col1, col2) - specific columnsCombine selections:
c(starts_with("Sepal"), ends_with("Width"))
where(is.numeric) & starts_with("x")
Apply the same operation to multiple columns.
Syntax:
across(.cols, .fns, .names = NULL)
.cols - which columns (use selection helpers).fns - function(s) to apply.names - control output names (default:
{.col}_{.fn})Transform multiple columns:
# Round all numeric columns
across(where(is.numeric), round)
# Scale to 0-1 range
across(c(mpg, hp, wt), \(x) x / max(x))
# Log transform with custom names
across(where(is.numeric), \(x) log(x + 1), .names = "log_{.col}")
Use \(x) to create anonymous functions where
x represents the current column.
Calculate statistics for multiple columns:
# Mean of all numeric columns
across(where(is.numeric), mean)
# Multiple functions
across(c(hp, wt), list(mean = mean, sd = sd))
# With grouping (use by parameter)
across(everything(), n_distinct)
# Default: col_fn
across(c(mpg, hp), list(mean = mean, sd = sd))
# Result: mpg_mean, mpg_sd, hp_mean, hp_sd
# Custom: fn.col
across(c(mpg, hp), list(mean = mean, sd = sd), .names = "{.fn}.{.col}")
# Result: mean.mpg, sd.mpg, mean.hp, sd.hp
Select columns as a data frame for custom functions.
Syntax: pick(.cols)
Use in summarize block with custom functions that need a data frame:
# With custom function
calc_stats(pick(everything()))
# Select specific columns
my_function(pick(c(hp, wt, mpg)))
When your expression returns a data frame, use the unpack option:
Example:
# Custom function that returns data frame
calc_stats <- function(df) {
data.frame(mean_x = mean(df$x), sd_x = sd(df$x))
}
# In summarize block with unpack = TRUE:
calc_stats(pick(everything()))
# Result: mean_x and sd_x as separate columns
Check conditions across multiple columns in the expression filter block.
if_any(): TRUE when condition is true for at least one column
# Rows with any NA
if_any(everything(), is.na)
# Any numeric column > 100
if_any(where(is.numeric), \(x) x > 100)
# Search across text columns
if_any(where(is.character), \(x) x == "setosa")
if_all(): TRUE when condition is true for all columns
# All numeric columns positive
if_all(where(is.numeric), \(x) x > 0)
# All width measurements > 2
if_all(ends_with("Width"), \(x) x > 2)
# No missing values
if_all(everything(), \(x) !is.na(x))
# Round numeric columns
across(where(is.numeric), round)
# Scale to max
across(c(mpg, hp, wt), \(x) x / max(x))
# Uppercase text
across(where(is.character), toupper)
# Log transform
across(where(is.numeric), \(x) log(x + 1), .names = "log_{.col}")
# Means (add grouping with by parameter)
across(where(is.numeric), mean)
# Multiple statistics
across(c(hp, wt), list(mean = mean, sd = sd, min = min, max = max))
# Count non-missing
across(everything(), \(x) sum(!is.na(x)))
# Complete rows only
if_all(everything(), \(x) !is.na(x))
# Any negative
if_any(where(is.numeric), \(x) x < 0)
# All widths > threshold
if_all(ends_with("Width"), \(x) x > 2.5)
# Search text columns
if_any(where(is.character), \(x) grepl("pattern", x))
Start simple: Test on one column, then use
across()
Check preview: Verify results in the block preview
Unpack option: In summarize, enable unpack when expressions return data frames
Combine helpers: Use & and
| to combine selections
across(where(is.numeric) & starts_with("Sepal"), mean)
Function syntax:
round (no arguments needed)\(x) round(x, 2) (with
arguments)x represents the current columnMissing values: Add na.rm = TRUE to
aggregation functions
across(where(is.numeric), \(x) mean(x, na.rm = TRUE))
For comprehensive documentation on column-wise operations, see:
These resources provide additional examples, edge cases, and advanced techniques for working with expressions in dplyr.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.