Chapter 06: Integrating Kernel Wrappers into Your Codebase

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Kjell Nygren

2026-06-11

Introduction

Chapter 05 described the internal structure of a kernel wrapper: how inputs are converted, how the runner is dispatched, and how results are converted back. This chapter takes a step back and looks at how kernel wrappers fit into the broader codebase of a package.

Two questions arise immediately:

What happens when OpenCL is not available? Every kernel wrapper must have a CPU path. A wrapper that simply returns zeros is safe but unhelpful; most real wrappers need to fall back to a correct CPU computation.
How is the wrapper exposed? Some kernel wrappers have a direct interface into R (callable from R code). Others are purely internal C++ components, called by other C++ functions that hold the R-facing API. The choice depends on whether the computation has a natural direct R use.

nmathopencl contains examples of both patterns. The distribution-function wrappers (dnorm_opencl, pnorm_opencl, etc.) are exported R functions with rich CPU fallback logic. The GLM gradient wrapper (f2_f3_opencl) is a purely internal C++ component, called by a C++ dispatcher that also has a separate CPU implementation (f2_f3_non_opencl). Both patterns are explored in detail below.

The two integration patterns

Pattern 1: wrapper with a direct R interface

In this pattern the kernel wrapper (or a thin R function that calls it) is exported and callable directly from R. The CPU fallback is the equivalent computation using standard R or C functions — in nmathopencl’s case, the stats:: distribution functions.

R caller
  │
  ▼
R wrapper function  (exported, input validation, recycling)
  │  if inputs are non-finite, sd == 0, etc. → fallback_full()
  │
  ▼
.opencl_try_or_fallback()
  │  if !nmathopencl_has_opencl()           → fallback_expr()  (CPU path)
  │  if OpenCL call succeeds    → return GPU result
  │  if OpenCL call fails
  │    and fallback = TRUE      → fallback_expr()  (CPU path)
  │    and fallback = FALSE     → propagate error
  ▼
C++ kernel wrapper  (internal, not exported)
  │  #ifdef USE_OPENCL + nmathopencl_has_opencl() guard
  │  type conversion + program assembly + runner dispatch
  ▼
GPU result

The fallback can be triggered at two separate levels:

R level (before the C++ call): when input validation detects a condition the GPU path cannot handle (e.g. sd == 0, non-finite values). fallback_full() calls stats::dnorm(x, mean, sd, log = log) directly.
C++ / runtime level: .opencl_try_or_fallback() checks nmathopencl_has_opencl() before attempting the GPU path. If OpenCL is not available it calls fallback_expr() without ever touching the C++ kernel wrapper. If a GPU call throws an exception and fallback = TRUE, it catches the error and calls fallback_expr().

Pattern 2: wrapper as an internal C++ component

In this pattern the kernel wrapper has no direct R interface. It is called from within a C++ dispatcher function alongside a CPU counterpart. The R interface belongs to a higher-level function that selects between the two based on a use_opencl flag passed in by the caller.

R caller
  │
  ▼
Exported R function  (e.g. Ex_EnvelopeEval)
  │  validates inputs, passes use_opencl flag
  ▼
.EnvelopeEval_cpp()   (internal R → C++ bridge, [[Rcpp::export]])
  ▼
EnvelopeEval_cpp()    (C++ dispatcher)
  │  if use_opencl && nmathopencl_has_opencl()
  │    → f2_f3_opencl()     (OpenCL kernel wrapper)
  │  else
  │    → f2_f3_non_opencl() (pure C++ CPU implementation)
  ▼
Result (qf, grad) returned regardless of path taken

The two implementations — f2_f3_opencl and f2_f3_non_opencl — share the same function signature and return the same data structure. The caller cannot tell from the return value which path was taken.

Pattern 1 in detail: `dnorm_opencl`

The R wrapper

dnorm_opencl in R/normal_opencl.R is the user-facing function. It mirrors the interface of stats::dnorm and adds opencl_parallel, fallback, and verbose arguments.

# R/normal_opencl.R  (simplified)

#' @export
dnorm_opencl <- function(x, mean = 0, sd = 1, log = FALSE,
                         opencl_parallel = NA, fallback = FALSE,
                         verbose = FALSE) {

  # ── Input validation ──────────────────────────────────────────────────────
  # These checks mirror stats::dnorm behavior.
  if (!is.numeric(x))    stop("`x` must be numeric.")
  if (!is.numeric(mean)) stop("`mean` must be numeric.")
  if (!is.numeric(sd))   stop("`sd` must be numeric.")
  if (length(x) == 0L)   return(numeric(0))

  # ── Recycling (like stats::dnorm) ─────────────────────────────────────────
  len  <- max(length(x), length(mean), length(sd))
  xv   <- rep_len(as.double(x),    len)
  mv   <- rep_len(as.double(mean), len)
  sv   <- rep_len(as.double(sd),   len)
  logv <- rep_len(log,             len)

  # ── R-level fallback function ─────────────────────────────────────────────
  # Called when inputs contain conditions the GPU path cannot handle,
  # or when OpenCL is unavailable and fallback = TRUE.
  fallback_full <- function() {
    stats::dnorm(x, mean = mean, sd = sd, log = log)
  }

  # ── R-level conditions that force the CPU path ────────────────────────────
  if (any(!is.finite(xv) | !is.finite(mv) | !is.finite(sv))) {
    return(fallback_full())   # stats::dnorm handles NaN, Inf, NA
  }
  if (any(sv < 0)) {
    stop("`sd` must be non-negative.", call. = FALSE)
  }
  if (any(sv == 0)) {
    return(fallback_full())   # degenerate case; stats::dnorm handles it
  }

  # ── Dispatch: try GPU, fall back to CPU on failure if fallback = TRUE ─────
  log_int <- as.integer(logv)
  opc     <- .encode_opencl_parallel(opencl_parallel)

  .opencl_try_or_fallback(
    opencl_expr  = function() .dnorm_opencl(xv, mv, sv, log_int, opc, verbose),
    fallback_expr = fallback_full,
    fallback      = fallback,
    verbose       = verbose,
    fn_name       = "dnorm_opencl"
  )
}

.dnorm_opencl (dot-prefixed) is the internal Rcpp-exported symbol for the C++ kernel wrapper. It is not part of the public API; it exists only to make the C++ function callable from R.

The `.opencl_try_or_fallback` helper

This helper encapsulates the runtime dispatch logic that every Pattern 1 wrapper shares:

# R/opencl_linkage_utils.R

.opencl_try_or_fallback <- function(opencl_expr, fallback_expr,
                                    fallback, verbose, fn_name) {
  if (!nmathopencl_has_opencl()) {
    # OpenCL not available in this build or session — go straight to CPU.
    if (verbose)
      message(sprintf("[%s] OpenCL unavailable; using CPU fallback.", fn_name))
    return(fallback_expr())
  }

  # OpenCL available: try the GPU path.
  out <- tryCatch(opencl_expr(), error = function(e) e)

  if (inherits(out, "error")) {
    if (fallback) {
      # GPU call failed and the caller requested a fallback.
      if (verbose) {
        message(sprintf("[%s] OpenCL call failed; using CPU fallback.", fn_name))
        message(out$message)
      }
      return(fallback_expr())
    }
    stop(out$message, call. = FALSE)  # no fallback requested — propagate error
  }

  out  # GPU call succeeded
}

The design makes the fallback behavior explicit and controllable:

fallback = FALSE (default): if the GPU call fails, the error propagates to the caller. The caller sees an actual error rather than silently receiving CPU results.
fallback = TRUE: if the GPU call fails, the CPU path is used transparently. Useful in batch workflows where any result is better than an error.

The C++ kernel wrapper

The C++ kernel wrapper .dnorm_opencl is exported to R via // [[Rcpp::export(name = ".dnorm_opencl")]]. It is the minimal C++ entry point: it converts inputs, runs the GPU path if available, and returns zeros if not.

// src/kernel_wrappers.cpp  (within nmathopencl namespace)

// [[Rcpp::export(name = ".dnorm_opencl")]]
Rcpp::NumericVector dnorm_opencl(
    const Rcpp::NumericVector& x,
    const Rcpp::NumericVector& mean,
    const Rcpp::NumericVector& sd,
    const Rcpp::IntegerVector& give_log,
    int                        opencl_parallel_code,
    bool                       verbose
) {
  const int len = x.size();
  Rcpp::NumericVector out(len);   // zero-initialized

#ifdef USE_OPENCL
  if (!nmathopencl_has_opencl() || len == 0) return out;

  try {
    d_givelog_ndrange_kernel_fill(
        "src/dnorm_kernel.cl", "dnorm_kernel",
        len, {&x, &mean, &sd}, give_log, out, verbose);
  } catch (const std::exception& e) {
    if (verbose) Rcpp::Rcout << e.what() << "\n";
    throw;
  }
#endif

  return out;
}

Note that the C++ wrapper itself returns zeros when !nmathopencl_has_opencl(). It does not call stats::dnorm. The R wrapper is responsible for the fallback to stats::dnorm; the C++ wrapper simply reports “no GPU result” via zeros. This keeps the C++ layer free of any R evaluation machinery.

Pattern 2 in detail: `f2_f3_opencl`

The exported R function

Ex_EnvelopeEval (in R/ex_glmbayes.R) is the user-facing function. It accepts a use_opencl flag and delegates entirely to the C++ dispatcher:

# R/ex_glmbayes.R

#' @export
Ex_EnvelopeEval <- function(G4, y, x, mu, P, alpha, wt,
                            family, link,
                            use_opencl = FALSE,
                            verbose    = FALSE) {
  # Input validation (matrix/vector type checks) ...

  .EnvelopeEval_cpp(G4, y, x, mu, P, alpha, wt,
                    family, link, use_opencl, verbose)
}

There is no R-level fallback function here. The fallback is handled entirely inside the C++ dispatcher.

The C++ dispatcher

EnvelopeEval_cpp (inside src/) receives use_opencl and decides which C++ implementation to call:

// src/ (conceptual structure — details in actual source)

Rcpp::List EnvelopeEval_cpp(
    Rcpp::NumericMatrix G4, Rcpp::NumericVector y,
    Rcpp::NumericMatrix x,  Rcpp::NumericMatrix mu,
    Rcpp::NumericMatrix P,  Rcpp::NumericVector alpha,
    Rcpp::NumericVector wt,
    std::string family, std::string link,
    bool use_opencl, bool verbose
) {
  // Prepare shared inputs (common to both paths) ...

  if (use_opencl && nmathopencl_has_opencl()) {
    // GPU path: call the OpenCL kernel wrapper
    return ex_glmbayes::opencl::f2_f3_opencl(
        family, link, b, y, x, mu, P, alpha, wt, verbose);
  } else {
    // CPU path: call the pure C++ implementation
    return ex_glmbayes::f2_f3_non_opencl(
        family, link, b, y, x, mu, P, alpha, wt);
  }
}

Both f2_f3_opencl and f2_f3_non_opencl return a Rcpp::List with identical structure: list(qf = numeric(m1), grad = matrix(m1, l2)). The dispatcher’s caller cannot tell from the return value which path was used.

Why a dedicated CPU implementation?

For Pattern 1 (distribution functions), the CPU fallback is an existing well-tested function from stats::. No separate CPU implementation is needed.

For the GLM gradient computation, no equivalent off-the-shelf CPU function exists. f2_f3_non_opencl is a pure C++ implementation of the same mathematical computation, written without any OpenCL dependency. It compiles on every platform and produces bit-for-bit equivalent results to the GPU path (within double-precision rounding).

Having both implementations under explicit control also makes it possible to benchmark them directly: use_opencl = FALSE forces the CPU path even on a GPU-equipped machine.

Choosing between the two patterns

The choice between Pattern 1 and Pattern 2 comes down to whether there is a natural existing CPU computation to fall back to.

Criterion	Pattern 1 (R interface + R fallback)	Pattern 2 (C++ dispatch + CPU implementation)
Existing CPU function available?	Yes (`stats::`, `base::`, etc.)	No; need to write the CPU implementation
Does the computation have a direct R use?	Yes (called directly from R)	Often not (called from a C++ simulation loop)
Where does fallback live?	R level (`fallback_full()`) + runtime (`nmathopencl_has_opencl()`)	C++ level (`use_opencl && nmathopencl_has_opencl()`)
Caller can request optional fallback?	Yes (`fallback = TRUE/FALSE` argument)	Caller controls via `use_opencl` flag
Wrapper directly R-callable?	Yes (exported via `[[Rcpp::export]]`)	Not necessarily — may be purely internal C++

Both patterns guarantee that the package compiles and runs correctly on any machine. The GPU path is always optional; the CPU path always produces a valid (if unaccelerated) result.

Naming conventions

nmathopencl uses a consistent naming scheme to make the role of each function clear:

Name	Type	Role
`dnorm_opencl`	Exported R function	User-facing API; validates inputs; manages fallback
`.dnorm_opencl`	Internal R → C++ bridge	Rcpp export; positional R → C++ call only
`nmathopencl::dnorm_opencl`	C++ kernel wrapper	`#ifdef` guard; type conversion; runner dispatch
`nmathopencl::dnorm_kernel_runner`	C++ kernel runner	Full OpenCL lifecycle; `#ifdef USE_OPENCL` only
`Ex_EnvelopeEval`	Exported R function	User-facing API; passes `use_opencl` flag
`.EnvelopeEval_cpp`	Internal R → C++ bridge	Positional R → C++ call only
`f2_f3_opencl`	C++ kernel wrapper	OpenCL path; used inside dispatcher
`f2_f3_non_opencl`	C++ CPU implementation	CPU path; used inside same dispatcher

The .dot prefix on internal R functions signals that they are not part of the public API and will not appear in ?help search or autocompletion.

For your own package, a consistent analogous scheme might be:

myfunc_opencl()        # exported R function  (if direct R use)
.myfunc_opencl()       # internal R → C++ bridge
mypkg::myfunc_opencl() # C++ kernel wrapper (in namespace)
mypkg::myfunc_runner() # C++ kernel runner   (in namespace, #ifdef only)
mypkg::myfunc_cpu()    # C++ CPU fallback    (if Pattern 2)

Summary

Every kernel wrapper needs a CPU path. The two patterns differ in where that path lives and who controls the dispatch:

Pattern 1 puts the fallback logic in R, using the existing stats:: ecosystem. It is the right choice when the computation mirrors an existing R function and has direct R users.
Pattern 2 puts the fallback logic in C++, alongside a dedicated CPU implementation. It is the right choice when the computation is novel, when it is called from a C++ simulation loop rather than directly from R, or when benchmarking between the two paths is important.

In both patterns the OpenCL infrastructure — the runner and the kernel — is identical. What differs is only how the wrapper is wired into the rest of the package.

Chapter 12 describes the nmathopencl R API in full, showing how the distribution-function wrappers are documented and organized. Chapter 10 works through the ex_glmbayes pattern end-to-end.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.

Chapter 06: Integrating Kernel Wrappers into Your Codebase

Kjell Nygren

2026-06-11

Introduction

The two integration patterns

Pattern 1: wrapper with a direct R interface

Pattern 2: wrapper as an internal C++ component

Pattern 1 in detail: dnorm_opencl

The R wrapper

The .opencl_try_or_fallback helper

The C++ kernel wrapper

Pattern 2 in detail: f2_f3_opencl

The exported R function

The C++ dispatcher

Why a dedicated CPU implementation?

Choosing between the two patterns

Naming conventions

Summary

Pattern 1 in detail: `dnorm_opencl`

The `.opencl_try_or_fallback` helper

Pattern 2 in detail: `f2_f3_opencl`