The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Getting Started with llamaR

llamaR provides R bindings to llama.cpp for running Large Language Models locally, with optional Vulkan GPU acceleration via ggmlR. This vignette walks through the core workflow: get a model, load it, generate text, tokenize, and extract embeddings. For the chat/server side see vignette("chat-and-agents").

library(llamaR)

1. Getting a model

llamaR works with GGUF files. Download one from the Hugging Face Hub (cached under ~/.cache/llamaR/ by default):

# List the GGUF files in a repo
llama_hf_list("TheBloke/Mistral-7B-Instruct-v0.2-GGUF")

# Download one (by filename or by quantization pattern)
path <- llama_hf_download(
  "TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
  pattern = "Q4_K_M"
)

Or point at any GGUF file you already have on disk.


2. Loading a model and creating a context

A model holds the weights; a context holds the working state (KV cache) for one generation session. Both are external pointers with GC finalizers, so explicit freeing is optional.

model <- llama_load_model(path, n_gpu_layers = -1L)   # -1 = offload all layers
ctx   <- llama_new_context(model, n_ctx = 4096L)

llama_model_info(model)   # size, n_params, context length, heads, ...

n_gpu_layers = -1L offloads every layer to the GPU when Vulkan is available, and falls back to CPU otherwise.


3. Generating text

llama_generate(ctx, "The capital of France is", max_new_tokens = 32L)

Sampling is controlled by arguments (set temp = 0 for greedy decoding):

llama_generate(
  ctx, "Write a haiku about autumn.",
  max_new_tokens = 64L,
  temp           = 0.7,
  top_p          = 0.9,
  top_k          = 40L,
  repeat_penalty = 1.1
)

Pass with_timings = TRUE to get token throughput alongside the text.


4. Chat models and templates

Instruction-tuned models expect their prompt wrapped in a chat template ([INST]…[/INST], <|im_start|>…, etc.). llama_chat_apply_template() builds that prompt from a list of role/content messages:

messages <- list(
  list(role = "system",    content = "You are a helpful assistant."),
  list(role = "user",      content = "Name three primary colors.")
)

prompt <- llama_chat_apply_template(messages)   # uses the model's built-in template
llama_generate(ctx, prompt, max_new_tokens = 64L)

For multi-turn chat with history management, use chat_llamar() instead — see vignette("chat-and-agents").


5. Tokenization

tokens <- llama_tokenize(ctx, "Hello, world!")
tokens

llama_detokenize(ctx, tokens)

When tokenizing a prompt that already contains role markers from a chat template, set parse_special = TRUE so markers like [INST] become single control tokens rather than literal characters:

prompt <- llama_chat_apply_template(list(list(role = "user", content = "hi")))
llama_tokenize(ctx, prompt, parse_special = TRUE)

6. Embeddings

Create the context in embedding mode, then extract vectors. Single text:

emb_model <- llama_load_model("embedding-model.gguf")
emb_ctx   <- llama_new_context(emb_model, embedding = TRUE)

v <- llama_embeddings(emb_ctx, "The quick brown fox")
length(v)

A batch of texts in one call:

m <- llama_embed_batch(emb_ctx, c("first text", "second text", "third text"))
dim(m)   # one row per input

ragnar-compatible provider

embed_llamar() is a higher-level helper that loads the model for you and returns a provider suitable for ragnar_store_create(embed = ...). Called with a model only, it returns a closure (partial application); called with text, it returns a matrix.

library(ragnar)

store <- ragnar_store_create(
  location = "store.duckdb",
  embed    = embed_llamar(model = "embedding-model.gguf", n_gpu_layers = -1L)
)
ragnar_store_insert(store, documents)
ragnar_store_build_index(store)
ragnar_retrieve(store, "search query")

Combine this with a local chat_llamar() for a fully local RAG stack — see vignette("chat-and-agents").


7. Serving and chatting

To talk to a model over HTTP, or to use it through the ellmer/ragnar toolchain, see vignette("chat-and-agents"):


See also

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.