The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Working with Expressions: Helper Functions for Advanced Data Manipulation

Introduction

Three blocks accept expressions: expression filter, mutate, and summarize.

Helper functions make expressions more powerful by applying operations to multiple columns at once. This vignette covers:

See also: dplyr column-wise operations and dplyr window functions.

Expression Blocks

Expression filter block: Logical expressions to keep rows

Mutate block: Create or modify columns

Summarize block: Aggregate data

Useful Functions for Mutate

Arithmetic

Logs and exponentials

Offsets

Cumulative aggregates

Ranking

Logical comparisons

Examples

# Calculate percentage
across(c(hp, wt), \(x) x / sum(x) * 100)

# Lag differences
mpg - lag(mpg)

# Cumulative sums by group (use by parameter)
cumsum(sales)

# Rank values
min_rank(desc(hp))

See dplyr window functions for more.

Useful Functions for Summarize

Center

Spread

Range

Position

Count

Sums and products

Examples

# Basic statistics
across(where(is.numeric), list(mean = mean, sd = sd))

# Count by group (use by parameter)
n()

# Multiple stats
list(
  avg = mean(hp),
  min = min(hp),
  max = max(hp),
  count = n()
)

For handling missing values, add na.rm = TRUE:

mean(hp, na.rm = TRUE)
across(where(is.numeric), \(x) mean(x, na.rm = TRUE))

See dplyr summarise for more.

Column Selection Helpers

Select columns by name pattern or type (used inside across(), if_any(), if_all()):

Combine selections:

c(starts_with("Sepal"), ends_with("Width"))
where(is.numeric) & starts_with("x")

The across() Function

Apply the same operation to multiple columns.

Syntax: across(.cols, .fns, .names = NULL)

In Mutate Block

Transform multiple columns:

# Round all numeric columns
across(where(is.numeric), round)

# Scale to 0-1 range
across(c(mpg, hp, wt), \(x) x / max(x))

# Log transform with custom names
across(where(is.numeric), \(x) log(x + 1), .names = "log_{.col}")

Use \(x) to create anonymous functions where x represents the current column.

In Summarize Block

Calculate statistics for multiple columns:

# Mean of all numeric columns
across(where(is.numeric), mean)

# Multiple functions
across(c(hp, wt), list(mean = mean, sd = sd))

# With grouping (use by parameter)
across(everything(), n_distinct)

Custom names

# Default: col_fn
across(c(mpg, hp), list(mean = mean, sd = sd))
# Result: mpg_mean, mpg_sd, hp_mean, hp_sd

# Custom: fn.col
across(c(mpg, hp), list(mean = mean, sd = sd), .names = "{.fn}.{.col}")
# Result: mean.mpg, sd.mpg, mean.hp, sd.hp

The pick() Function

Select columns as a data frame for custom functions.

Syntax: pick(.cols)

Use in summarize block with custom functions that need a data frame:

# With custom function
calc_stats(pick(everything()))

# Select specific columns
my_function(pick(c(hp, wt, mpg)))

Relationship to unpack parameter

When your expression returns a data frame, use the unpack option:

Example:

# Custom function that returns data frame
calc_stats <- function(df) {
  data.frame(mean_x = mean(df$x), sd_x = sd(df$x))
}

# In summarize block with unpack = TRUE:
calc_stats(pick(everything()))
# Result: mean_x and sd_x as separate columns

Filter Helpers: if_any() and if_all()

Check conditions across multiple columns in the expression filter block.

if_any(): TRUE when condition is true for at least one column

# Rows with any NA
if_any(everything(), is.na)

# Any numeric column > 100
if_any(where(is.numeric), \(x) x > 100)

# Search across text columns
if_any(where(is.character), \(x) x == "setosa")

if_all(): TRUE when condition is true for all columns

# All numeric columns positive
if_all(where(is.numeric), \(x) x > 0)

# All width measurements > 2
if_all(ends_with("Width"), \(x) x > 2)

# No missing values
if_all(everything(), \(x) !is.na(x))

Common Patterns

Mutate Block

# Round numeric columns
across(where(is.numeric), round)

# Scale to max
across(c(mpg, hp, wt), \(x) x / max(x))

# Uppercase text
across(where(is.character), toupper)

# Log transform
across(where(is.numeric), \(x) log(x + 1), .names = "log_{.col}")

Summarize Block

# Means (add grouping with by parameter)
across(where(is.numeric), mean)

# Multiple statistics
across(c(hp, wt), list(mean = mean, sd = sd, min = min, max = max))

# Count non-missing
across(everything(), \(x) sum(!is.na(x)))

Expression Filter Block

# Complete rows only
if_all(everything(), \(x) !is.na(x))

# Any negative
if_any(where(is.numeric), \(x) x < 0)

# All widths > threshold
if_all(ends_with("Width"), \(x) x > 2.5)

# Search text columns
if_any(where(is.character), \(x) grepl("pattern", x))

Tips

Start simple: Test on one column, then use across()

Check preview: Verify results in the block preview

Unpack option: In summarize, enable unpack when expressions return data frames

Combine helpers: Use & and | to combine selections

across(where(is.numeric) & starts_with("Sepal"), mean)

Function syntax:

Missing values: Add na.rm = TRUE to aggregation functions

across(where(is.numeric), \(x) mean(x, na.rm = TRUE))

Learn More

For comprehensive documentation on column-wise operations, see:

These resources provide additional examples, edge cases, and advanced techniques for working with expressions in dplyr.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.