The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
R6 moved from
Suggests to Imports — package now loads
correctly when R6 is not pre-installed.LearnerClassifGGML and
LearnerRegrGGML R6 class definitions are now deferred until
mlr3/R6/paradox are available,
preventing namespace load failure in environments without these optional
packages.ggml_conv_2d
(IM2COL+GEMM) with ggml_conv_2d_direct
(GGML_OP_CONV_2D) in onnx_ggml.c —
SuperResolution GPU time 344 ms → 5 ms (~70×).wg512 pipeline
threshold lowered from >1024 to >=512 —
improves attention softmax at seq_len 512–1024.benchmark_ops.R (36-op
CPU/GPU micro-benchmark), profile_onnx_superres_gpu.R (GPU
profiler for SuperResolution).USE_SUBGROUP_NO_SHMEM path added to
mul_mmq.comp — on wavefront-64 devices (RDNA4,
subgroup_size=64) the block_a weight tile is loaded
directly into registers via subgroupShuffle /
subgroupBroadcast, eliminating the shared-memory round-trip
in block_a_to_shmem → block_a_to_registers. Measured on RX
9070: Flux 768×768 sampling 22.38s → 20.80s (~7% end-to-end; sampling is
not pure matmul so the gain on isolated Q4_K GEMM is higher).subgroup_no_shmem —
ggml_vulkan_device_caps() now returns this flag (logical),
indicating whether the shuffle mmq path is active.GL_EXT_shader_subgroup_extended_types_float16
added to mul_mmq.comp under
#ifdef USE_SUBGROUP_NO_SHMEM && FLOAT16 — required
for subgroupShuffle on float16_t components of
f16vec2.ggml_vulkan_device_caps() extended —
wavefronts_per_simd and arch fields added; all
14 fields now documented.pipeline_dequant_mul_mat_mat_q8_1_no_shmem —
registered in device struct; selected at dispatch when
subgroup_size == 64 and src0 is Q4_K / Q5_K / Q6_K; falls
back to standard mmq pipeline gracefully when not compiled.GGML_TYPE_Q2_K, Q3_K,
Q4_K, Q5_K, Q6_K
exported — these constants were defined in
tensors.R but missing from NAMESPACE;
roxygen2::roxygenise() now includes them.inst/examples/vulkan_caps.R extended —
new section shows USE_SUBGROUP_NO_SHMEM: ACTIVE/INACTIVE
with explanation of conditions.tests/testthat/test-vulkan.R
adds smoke tests for Q4_K / Q5_K / Q6_K quantized matmul via Vulkan (no
NaN/Inf, correct shape); test-vulkan-caps.R asserts
integer_dot_product=TRUE on RDNA4.get_device_architecture() now identifies RDNA4 by
wavefrontsPerSimd == 16 (distinct from RDNA3’s 8 and
RDNA1’s 20). Previously GFX1201 fell through to AMD_RDNA3
due to identical subgroup size range (min=32, max=64).VK_AMD_shader_core_properties queried at device
init — wavefronts_per_simd is now stored in
vk_device_struct and read once during
ggml_vk_get_device(), not just inside
get_device_architecture().SHADERGEN_DEFINES propagated to C++
compiler — configure now appends
SHADERGEN_DEFINES (which includes
-DGGML_VULKAN_COOPMAT_GLSLC_SUPPORT) to
VULKAN_CPPFLAGS. Previously these defines were only passed
to vulkan-shaders-gen, so all
#if defined(GGML_VULKAN_COOPMAT_GLSLC_SUPPORT) blocks in
ggml-vulkan.cpp were dead code at runtime.ggml_backend_vk_get_device_caps()
extended — now returns subgroup_min_size,
subgroup_max_size, wavefronts_per_simd, and
arch (string) in addition to the original 5 fields. R
function ggml_vulkan_device_caps() exposes all 9
fields.coopmat_support=YES, coopmat1_fa_support=YES —
KHR cooperative matrix GEMM and flash-attention paths now active.GGML_OP_FLASH_ATTN_EXT now accepts
K/V tensors in Q4_K format on
Vulkan. Previously Q4_K fell back to CPU; now it runs fully on GPU via
both the scalar and cooperative-matrix (KHR) paths.dequantize4_q4k() added to
flash_attn_base.glsl — decodes 4 consecutive Q4_K elements
from a block_q4_K_packed16 block: reconstructs the 6-bit
scale and min for the sub-block, reads two consecutive
uint16 from qs[], and extracts four nibbles.
Works for both K and V bindings.flash_attn.comp (FA_SCALAR) and
flash_attn_cm1.comp (FA_COOPMAT1) now compiled with
DATA_A_Q4_K / BLOCK_SIZE=QUANT_K_Q4_K=256.
Four SPIR-V variants generated: f32acc and f16acc for each path.vulkan-shaders-gen.cpp — q4_k added to the
FA scalar and coopmat1 generation conditions.ggml-vulkan.cpp —
CREATE_FA(GGML_TYPE_Q4_K, ...) added for FA_SCALAR and
FA_COOPMAT1; GGML_TYPE_Q4_K added to the supported-types
switch in ggml_backend_vk_device_supports_op.HSK) is a
multiple of 256 (e.g. DeepSeek-V2/V3 MLA). For HSK=128 (Llama, Mistral)
the shader is functionally correct but pads the inner loop to 256."ggml" engine for
parsnip::mlp() — registers a "ggml"
engine for both classification and regression modes. After
library(ggmlR) (with parsnip installed),
use:
mlp(hidden_units = 64, epochs = 100) |>
set_engine("ggml", batch_size = 32, backend = "auto") |>
set_mode("classification")Engine arguments: batch_size, backend,
verbose, validation_split,
optimizer, callbacks. All mlp()
parameters (hidden_units, epochs,
dropout, activation, learn_rate)
are mapped through.
backend = "gpu" in parsnip —
"gpu" is now correctly translated to "vulkan"
inside ggmlr_parsnip_fit_classif() and
ggmlr_parsnip_fit_regr(). Previously the string was passed
through and caused an unknown backend error.
learn_rate callback — the
learn_rate argument from mlp() is applied via
an internal on_epoch_begin callback that sets the optimizer
learning rate at the start of epoch 1. Works for both
"adam" and "sgd" optimizers.
New Suggests: parsnip,
tibble, rlang, dials.
New example:
inst/examples/tidymodels_integration.R — CPU vs GPU
comparison for iris classification and mtcars regression using the
parsnip engine.
LearnerClassifGGML /
LearnerRegrGGML always defined — R6 class
definitions are now unconditional (no longer wrapped in
if (requireNamespace("mlr3"))). This ensures the classes
are always present in the ggmlR namespace, so
ggmlR:::.register_mlr3() can be called reliably from
vignettes and tests regardless of package load order..onLoad() no
longer uses mlr3misc::register_namespace_callback() (which
had a bug in v0.21.0 causing R CMD check warning
namespace can be unloaded cleanly). Registration now uses
isNamespaceLoaded() + setHook() directly,
covering both “mlr3 already loaded” and “mlr3 loads after ggmlR”
scenarios.mlr3misc removed from
Suggests — no longer needed.inst/examples/mlr3_integration.R — CPU vs GPU comparison
for iris classification and mtcars regression, plus 3-fold CV.marshal_model.* / unmarshal_model.* S3
methods no longer appear in NAMESPACE as
S3method(mlr3::marshal_model, ...) — this caused
Error: namespace 'marshal_model' not found on package load.
Methods are now registered exclusively via
registerS3method() in .onLoad().test-parsnip.R — new tests: learn_rate
applied without error; backend="gpu" accepted and converted
to "vulkan" (skipped when Vulkan unavailable).test-mlr3-learner.R — explicit
ggmlR:::.register_mlr3() call at top of file for reliable
registration in R CMD check test process.Rcpp::asis
vignette engine. No rendering on CRAN runners.rmarkdown from Suggests (no longer
needed).ggml_graph_print() output captured in
test-graph-utils.R; C-level broadcast warnings captured in
ONNX broadcast and resize-broadcast tests.gguf_load(path) — opens a GGUF file
(v2/v3) and reads all metadata and tensor descriptors. Returns an S3
object of class "gguf".gguf_metadata(x) — returns all
key-value metadata pairs as a named list (architecture, tokenizer
config, quantization info, etc.).gguf_tensor_names(x) — lists all
tensor names in the file.gguf_tensor_info(x, name) — returns
shape, type, and size in bytes for a single tensor.gguf_tensor_data(x, name) —
dequantizes (if needed) and returns tensor weights as an R numeric array
with correct dimensions.gguf_free(x) — explicitly frees GGUF
context (also called by GC).print.gguf() method shows file version, tensor count,
and metadata count.VK_KHR_push_descriptor): unchanged — when the extension is
available and maxPushDescriptors >= 12, descriptor sets
are pushed directly into the command buffer via
pushDescriptorSetKHR(), eliminating descriptor pool
overhead. Falls back to the traditional descriptor pool path on hardware
without the extension.fit() now accepts a
callbacks parameter for sequential models (passed through
to ggml_fit_sequential()).test-gguf.R,
test-graph-utils.R, test-inplace-ops.R,
test-keras-api.R, test-misc-ops.R,
test-model-ops.R, test-print-methods.R,
test-tensor-utils.R, test-threading.R,
test-autograd-missing.R,
test-nn-functional-missing.R,
test-quants-missing.R.src/ and
inst/include/ headers: configure and
configure.win now automatically sync all public headers
from src/ to inst/include/ at install time.
Previously, changes to GGML_MAX_DIMS (4→5) and other
structs in src/ggml.h were not propagated to the exported
headers, causing segfaults in downstream packages (e.g. sd2R).tests/testthat/test-headers-sync.R to verify that
inst/include/ headers remain in sync with src/
headers and that GGML_MAX_DIMS is consistent.ggml_view_5d() — new API function for
creating 5D views with explicit strides, extending the existing 1D–4D
view family. Uses the existing ggml_view_impl()
internally.ggml_repeat_5d() — new API function
for tiling tensors up to 5D. CPU kernels
(ggml_compute_forward_repeat_f32,
ggml_compute_forward_repeat_f16) updated with a 5th loop
dimension. Vulkan dispatch collapses dim3×dim4 into push constants
transparently (no shader changes needed — push constants remain at 128
bytes).onnx_ggml.c (~20 sites):
ne[GGML_MAX_DIMS] arrays, switch with
case 5: new_tensor_5d.onnx_broadcast_align): all
reshape/new_tensor calls use dimension-aware helpers.onnx_reshape_nd().ggml_repeat_5d().tmap_put_nd() and slice_fill arrays
updated to GGML_MAX_DIMS.onnx_reshape_nd(),
onnx_new_tensor_nd(), ne_product() — eliminate
switch/case duplication.ggml_permute API
limitation).ConstantOfShape read the
value TensorProto attribute as float regardless of
data_type. When data_type=7 (INT64), the
8-byte int64 was reinterpreted as a 4-byte float, producing garbage
values (~1.4e-45 instead of 1). This broke attention mask generation
(fill=0 instead of 1) and position ID generation (NonZero on zeros =
empty).ConstantOfShape now checks data_type
and correctly handles INT64, INT32, DOUBLE, and FLOAT value
attributes.ggml_get_rows which only supports 2D data.
For axis=0 on rank>2 (e.g. CaiT QKV split on
[48,576,6,3]), the tensor is now reshaped to 2D, gathered,
and reshaped back.GGML_OP_SCATTER_ELEMENTS added to the ggml engine
with both CPU kernel and Vulkan compute shader.scatter_elements.comp):
two variants compiled at install time —
scatter_elements_none (overwrite) and
scatter_elements_add (atomicAdd via
GL_EXT_shader_atomic_float). Data is copied to output via
vkCmdCopyBuffer with a pipeline barrier before the scatter
dispatch.ScatterElements op with
axis=0 and reduction="none"/"add" attributes.
Indices cast to I32, updates/data cast to F32 automatically.ggml_map_custom3 op. The CPU kernel
computes 2D relative position bias directly:
bias[b,hq,wq,hk,wk] = dot(x, W_h) + dot(x_transposed, W_w).detect_pos_embed_blocks() identifies
contiguous node ranges with /pos_embed/ in output names,
extracts W_h/W_w initializer shapes to determine H, W, C, validates F32
data type.onnx_ggml_run(), input data is copied into pinned memory
before ggml_backend_tensor_set() — the Vulkan driver
detects the pinned source pointer and performs direct DMA transfer to
VRAM, bypassing the internal staging copy.ggml_backend_vk_host_buffer_type() returns
NULL or buffer is too small, the standard staging path is used
transparently.onnx_device_info(): added NULL guards for
ctx->graph and n_nodes == 0 edge cases that
caused segfault when called on models before first inference run.ggml_predict() with stochastic
dropout: nn_build_graph() now receives
training = FALSE during inference, so stochastic Bernoulli
dropout is disabled at predict time. Previously,
stochastic = TRUE dropout layers applied random masks
during inference, degrading accuracy.ggml_fit() return value: the return
value of ggml_fit() must be assigned back to
model to obtain trained weights
(model <- ggml_fit(...)). This is now clarified in all
examples and documentation. Using
history <- ggml_fit(...) without reassigning
model leaves the model with untrained weights.ggml_evaluate() return value: now
includes n_samples in addition to loss and
accuracy. Metrics are computed on all samples without
truncation (via ggml_predict() internally).inst/examples/titanic_classification.R — new end-to-end
binary classification example on the Titanic dataset. Demonstrates
feature engineering (Title, FamilySize, IsAlone), stratified train/val
split, one-hot encoding, dropout regularization, and manual validation
metrics (accuracy, precision, recall, F1, confusion matrix). Achieves
~82% val accuracy.weight_buf and
never re-transferred between runs. Previous architecture reloaded all
weights before every onnx_run() call — eliminated
entirely.ctx_weight / ctx contexts: weight
tensors live in a permanent GPU buffer that the scheduler never aliases;
compute tensors are managed by ggml_backend_sched
independently.onnx_device_info() — scheduler diagnostic: number of
splits, GPU/CPU op counts, CPU-only op list.inst/examples/benchmark_onnx.R):
proper VRAM cleanup between models via rm() +
gc().onnx_load(path, device, input_shapes) — load an ONNX
model file, build a ggml computation graph, and allocate tensors on
Vulkan GPU or CPU. Weights are loaded via memory-mapped file (zero-copy
where possible).onnx_run(model, inputs) — run inference on a loaded
ONNX model with named input data.onnx_inputs(model) — list expected input tensor names
and shapes.onnx_summary(model) — return model metadata (IR
version, opset, producer, ops used).print.onnx_model() — formatted summary of a loaded ONNX
model.input_shapes parameter for models with dynamic
dimensions: specify fixed shapes at load time
(e.g. input_shapes = list(image = c(1L, 3L, 224L, 224L))).auto_pad attribute (SAME_UPPER, SAME_LOWER) supported
for Conv and pooling ops.input_shapes (Conv,
Reshape, Transpose)input_shapes (1180 nodes)input_shapes (482 nodes:
MatMul, LayerNorm, GELU, Softmax)inst/lib/libggml.a, breaking static linking from dependent
packages (e.g. llamaR).dp_train(make_model, data, loss_fn, forward_fn, target_fn, n_gpu, n_iter, lr, max_norm, verbose)
— data-parallel training across multiple replicas. Weights are broadcast
from replica 0 before the first step; gradients are averaged across
replicas each iteration; weights are re-broadcast after each optimizer
update. Returns list(params, loss_history, model).ag_mul and ag_sub now support CPU
broadcast: [d×s] * [1×s] and [d×s] * [d×1]
shapes work correctly with proper gradient reduction.ag_softmax_cross_entropy_loss accepts integer target
vectors (0-based class indices) and converts them to one-hot
automatically.ggml_sum_rows f16 on Vulkan: F16→F16 dispatch now
supported natively (no CPU fallback).ag_tensor() / ag_param() —
environment-backed tensors with reference semantics; in-place optimizer
updates visible to all references.with_grad_tape({ ... }) — enables the global gradient
tape for the enclosed forward pass.backward(loss) — reverse-mode automatic
differentiation; returns a gradient environment keyed by tensor id.ag_matmul, ag_add
(with bias broadcast), ag_sub, ag_mul,
ag_scale.ag_relu, ag_sigmoid,
ag_tanh, ag_softmax.ag_sum, ag_mean,
ag_log, ag_exp, ag_pow,
ag_clamp.ag_reshape, ag_transpose.ag_mse_loss,
ag_cross_entropy_loss,
ag_softmax_cross_entropy_loss (numerically-stable
fused).optimizer_sgd() — SGD with optional momentum.optimizer_adam() — Adam with bias-corrected moment
estimates.ag_linear() — Glorot-initialised dense layer
(closure-based, returns $forward,
$params()).ag_gradcheck() — central finite-difference gradient
checker (like torch.autograd.gradcheck).ag_sequential(...) — ordered layer container; collects
all parameters for the optimizer.ag_dropout(rate) — inverted dropout; identity in eval
mode.ag_batch_norm(num_features) — batch normalisation with
running statistics and learnable γ/β.ag_embedding(vocab_size, dim) — token lookup with
scatter-add backward.ag_train(model) / ag_eval(model) — switch
all sub-layers between train and eval mode.ag_dataloader(x, y, batch_size, shuffle, col_major) —
mini-batch iterator with shuffle and $epoch() helper.lr_scheduler_step(optimizer, step_size, gamma) —
step-decay learning rate.lr_scheduler_cosine(optimizer, T_max, lr_min, restart)
— cosine-annealing (with optional SGDR warm restarts).clip_grad_norm(params, grads, max_norm) — clips all
gradients by global L2 norm in-place.ggml_layer_lstm() — LSTM recurrent layer (unrolled
BPTT).ggml_layer_gru() — GRU recurrent layer (unrolled
BPTT).ggml_layer_global_max_pooling_2d() — reduces
[H,W,C] to [C] via max pooling.ggml_layer_global_average_pooling_2d() — reduces
[H,W,C] to [C] via average pooling.ggml_save_model() — saves full model (architecture +
weights) to RDS file.ggml_load_model() — restores a model saved with
ggml_save_model().ggml_dense(), ggml_conv_2d(),
ggml_conv_1d(), ggml_batch_norm(),
ggml_embedding(), ggml_lstm(),
ggml_gru() — layer object constructors returning a reusable
ggml_layer object.ggml_apply(tensor, layer) — applies a
ggml_layer object to a tensor node; shared weights by
object identity.ggml_layer_dropout() — dropout with deterministic or
stochastic (per-epoch Bernoulli mask) mode.ggml_layer_embedding() — token embedding lookup for
integer inputs.ggml_input() gains dtype argument
("float32" or "int32").ggml_model() and
ggml_predict().ggml_input() — declare a symbolic input tensor node
(Functional API).ggml_model() — assemble a
ggml_functional_model from input/output nodes.ggml_layer_add() — element-wise addition of tensor
nodes (residual connections).ggml_layer_concatenate() — concatenate tensor nodes
along an axis.ggml_layer_*() functions now accept a
ggml_tensor_node as first argument (Functional API
mode).ggml_compile(), ggml_fit(),
ggml_evaluate(), ggml_predict() are now S3
generics with methods for ggml_functional_model.ggml_fit_opt() — low-level optimizer loop with
callbacks and learning-rate control.ggml_callback_early_stopping() — stops training when a
metric stagnates.ggml_schedule_step_decay() — step learning-rate
decay.ggml_schedule_cosine_decay() — cosine learning-rate
annealing.ggml_schedule_reduce_on_plateau() — reduces LR when
metric stops improving.ggml_opt_init_for_fit(),
ggml_opt_set_lr(), ggml_opt_get_lr() —
learning-rate control without recreating the optimizer context.configure.win.ggml_layer_conv_1d() — 1D convolution layer.ggml_layer_batch_norm() — batch normalization
layer.ggml_predict_classes() — argmax wrapper returning
1-based class indices.summary.ggml_sequential_model() — detailed model
summary with parameter counts.ggml_fit() now returns model$history
(class ggml_history) with print and
plot methods.ggml_model_sequential(),
ggml_layer_dense(), ggml_layer_conv_2d(),
ggml_layer_max_pooling_2d(),
ggml_layer_flatten(), ggml_compile(),
ggml_fit(), ggml_evaluate(),
ggml_predict(), ggml_save_weights(),
ggml_load_weights().ggml_timestep_embedding() — sinusoidal timestep
embeddings.ggml_set_f32_nd(),
ggml_get_f32_nd(), ggml_set_i32_nd(),
ggml_get_i32_nd().ggml_tensor_nb(),
ggml_tensor_num(), ggml_tensor_copy(),
ggml_tensor_set_f32_scalar(),
ggml_get_first_tensor(),
ggml_get_next_tensor().libggml.a exported for linking by
dependent packages.gguf.cpp added for GGUF file format support.inst/include/ for
LinkingTo.ggml_opt_init(),
ggml_opt_free(), ggml_opt_fit(),
ggml_opt_epoch(), ggml_opt_eval().ggml_opt_dataset_init(),
ggml_opt_dataset_data(),
ggml_opt_dataset_labels(),
ggml_opt_dataset_shuffle().ggml_opt_result_init(),
ggml_opt_result_loss(),
ggml_opt_result_accuracy(),
ggml_opt_result_pred().These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.