The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This tutorial covers the lower-level API for full control over text
generation. While quick_llama() is convenient for simple
tasks, the core functions give you fine-grained control over model
loading, context management, and generation parameters.
The recommended workflow consists of four steps:
model_load() - Load the model into
memory oncecontext_create() - Create a reusable
context for inferenceapply_chat_template() - Format prompts
correctly for the modelgenerate() - Generate text from the
contextUse model_load() to load a GGUF model into memory:
library(localLLM)
# Load the default model
model <- model_load("Llama-3.2-3B-Instruct-Q5_K_M.gguf")
# Or load from a URL (downloaded and cached automatically)
model <- model_load(
"https://huggingface.co/unsloth/gemma-3-4b-it-qat-GGUF/resolve/main/gemma-3-4b-it-qat-Q5_K_M.gguf"
)
# With GPU acceleration (offload layers to GPU)
model <- model_load(
"Llama-3.2-3B-Instruct-Q5_K_M.gguf",
n_gpu_layers = 999 # Offload as many layers as possible
)| Parameter | Default | Description |
|---|---|---|
model_path |
- | Path, URL, or cached model name |
n_gpu_layers |
0 | Number of layers to offload to GPU |
use_mmap |
TRUE | Memory-map the model file |
use_mlock |
FALSE | Lock model in RAM (prevents swapping) |
The context manages the inference state and memory allocation:
# Create a context with default settings
ctx <- context_create(model)
# Create a context with custom settings
ctx <- context_create(
model,
n_ctx = 4096, # Context window size (tokens)
n_threads = 8, # CPU threads for generation
n_seq_max = 1 # Maximum parallel sequences
)| Parameter | Default | Description |
|---|---|---|
n_ctx |
512 | Context window size in tokens |
n_threads |
auto | Number of CPU threads |
n_seq_max |
1 | Max parallel sequences (for batch generation) |
verbosity |
0 | Logging level (0=quiet, 3=verbose) |
The context window (n_ctx) determines how much text the
model can “see” at once. Larger values allow longer conversations but
use more memory.
Modern LLMs are trained on specific conversation formats. The
apply_chat_template() function formats your messages
correctly:
# Define a conversation as a list of messages
messages <- list(
list(role = "system", content = "You are a helpful R programming assistant."),
list(role = "user", content = "How do I read a CSV file?")
)
# Apply the model's chat template
formatted_prompt <- apply_chat_template(model, messages)
cat(formatted_prompt)#> <|begin_of_text|><|start_header_id|>system<|end_header_id|>
#>
#> You are a helpful R programming assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
#>
#> How do I read a CSV file?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
You can include multiple turns in the conversation:
messages <- list(
list(role = "system", content = "You are a helpful assistant."),
list(role = "user", content = "What is R?"),
list(role = "assistant", content = "R is a programming language for statistical computing."),
list(role = "user", content = "How do I install packages?")
)
formatted_prompt <- apply_chat_template(model, messages)Use generate() to produce text from the formatted
prompt:
#> To read a CSV file in R, you can use the `read.csv()` function:
#>
#> ```r
#> data <- read.csv("your_file.csv")
#> ```
output <- generate(
ctx,
formatted_prompt,
max_tokens = 200, # Maximum tokens to generate
temperature = 0.0, # Creativity (0 = deterministic)
top_k = 40, # Consider top K tokens
top_p = 1.0, # Nucleus sampling threshold
repeat_last_n = 0, # Tokens to consider for repetition penalty
penalty_repeat = 1.0, # Repetition penalty (>1 discourages)
seed = 1234 # Random seed for reproducibility
)| Parameter | Default | Description |
|---|---|---|
max_tokens |
256 | Maximum tokens to generate |
temperature |
0.0 | Sampling temperature (0 = greedy) |
top_k |
40 | Top-K sampling |
top_p |
1.0 | Nucleus sampling (1.0 = disabled) |
repeat_last_n |
0 | Window for repetition penalty |
penalty_repeat |
1.0 | Repetition penalty multiplier |
seed |
1234 | Random seed |
Here’s a complete workflow putting it all together:
library(localLLM)
# 1. Load model with GPU acceleration
model <- model_load(
"Llama-3.2-3B-Instruct-Q5_K_M.gguf",
n_gpu_layers = 999
)
# 2. Create context with appropriate size
ctx <- context_create(model, n_ctx = 4096)
# 3. Define conversation
messages <- list(
list(
role = "system",
content = "You are a helpful R programming assistant who provides concise code examples."
),
list(
role = "user",
content = "How do I create a bar plot in ggplot2?"
)
)
# 4. Format prompt
formatted_prompt <- apply_chat_template(model, messages)
# 5. Generate response
output <- generate(
ctx,
formatted_prompt,
max_tokens = 300,
temperature = 0,
seed = 42
)
cat(output)#> Here's how to create a bar plot in ggplot2:
#>
#> ```r
#> library(ggplot2)
#>
#> # Sample data
#> df <- data.frame(
#> category = c("A", "B", "C", "D"),
#> value = c(25, 40, 30, 45)
#> )
#>
#> # Create bar plot
#> ggplot(df, aes(x = category, y = value)) +
#> geom_bar(stat = "identity", fill = "steelblue") +
#> theme_minimal() +
#> labs(title = "Bar Plot Example", x = "Category", y = "Value")
#> ```
For advanced use cases, you can work directly with tokens:
#> [1] 9906 11 1695 0
#> [1] "Hello, world!"
Loading a model is expensive. Load once and reuse:
# Good: Load once, use many times
model <- model_load("model.gguf")
ctx <- context_create(model)
for (prompt in prompts) {
result <- generate(ctx, prompt)
}
# Bad: Loading in a loop
for (prompt in prompts) {
model <- model_load("model.gguf") # Slow!
ctx <- context_create(model)
result <- generate(ctx, prompt)
}Larger contexts use more memory. Match n_ctx to your
needs:
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.