The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
llamaR provides R bindings to llama.cpp for running
Large Language Models locally, with optional Vulkan GPU acceleration via
ggmlR. This vignette
walks through the core workflow: get a model, load it, generate text,
tokenize, and extract embeddings. For the chat/server side see
vignette("chat-and-agents").
llamaR works with GGUF files. Download one from the Hugging Face Hub
(cached under ~/.cache/llamaR/ by default):
# List the GGUF files in a repo
llama_hf_list("TheBloke/Mistral-7B-Instruct-v0.2-GGUF")
# Download one (by filename or by quantization pattern)
path <- llama_hf_download(
"TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
pattern = "Q4_K_M"
)Or point at any GGUF file you already have on disk.
A model holds the weights; a context holds the working state (KV cache) for one generation session. Both are external pointers with GC finalizers, so explicit freeing is optional.
model <- llama_load_model(path, n_gpu_layers = -1L) # -1 = offload all layers
ctx <- llama_new_context(model, n_ctx = 4096L)
llama_model_info(model) # size, n_params, context length, heads, ...n_gpu_layers = -1L offloads every layer to the GPU when
Vulkan is available, and falls back to CPU otherwise.
Sampling is controlled by arguments (set temp = 0 for
greedy decoding):
llama_generate(
ctx, "Write a haiku about autumn.",
max_new_tokens = 64L,
temp = 0.7,
top_p = 0.9,
top_k = 40L,
repeat_penalty = 1.1
)Pass with_timings = TRUE to get token throughput
alongside the text.
Instruction-tuned models expect their prompt wrapped in a chat
template ([INST]…[/INST], <|im_start|>…,
etc.). llama_chat_apply_template() builds that prompt from
a list of role/content messages:
messages <- list(
list(role = "system", content = "You are a helpful assistant."),
list(role = "user", content = "Name three primary colors.")
)
prompt <- llama_chat_apply_template(messages) # uses the model's built-in template
llama_generate(ctx, prompt, max_new_tokens = 64L)For multi-turn chat with history management, use
chat_llamar() instead — see
vignette("chat-and-agents").
When tokenizing a prompt that already contains role markers from a
chat template, set parse_special = TRUE so markers like
[INST] become single control tokens rather than literal
characters:
prompt <- llama_chat_apply_template(list(list(role = "user", content = "hi")))
llama_tokenize(ctx, prompt, parse_special = TRUE)Create the context in embedding mode, then extract vectors. Single text:
emb_model <- llama_load_model("embedding-model.gguf")
emb_ctx <- llama_new_context(emb_model, embedding = TRUE)
v <- llama_embeddings(emb_ctx, "The quick brown fox")
length(v)A batch of texts in one call:
m <- llama_embed_batch(emb_ctx, c("first text", "second text", "third text"))
dim(m) # one row per inputembed_llamar() is a higher-level helper that loads the
model for you and returns a provider suitable for
ragnar_store_create(embed = ...). Called with a model only,
it returns a closure (partial application); called with text, it returns
a matrix.
library(ragnar)
store <- ragnar_store_create(
location = "store.duckdb",
embed = embed_llamar(model = "embedding-model.gguf", n_gpu_layers = -1L)
)
ragnar_store_insert(store, documents)
ragnar_store_build_index(store)
ragnar_retrieve(store, "search query")Combine this with a local chat_llamar() for a fully
local RAG stack — see vignette("chat-and-agents").
To talk to a model over HTTP, or to use it through the ellmer/ragnar
toolchain, see vignette("chat-and-agents"):
llama_serve_openai() — OpenAI-compatible HTTP
server.chat_llamar() — an ellmer::Chat backed by
a local model.vignette("chat-and-agents") — server, ellmer, ragnar,
OpenCode.?llama_generate,
?llama_chat_apply_template, ?embed_llamarThese binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.