The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This document addresses common questions about using mLLMCelltype for cell type annotation in single-cell RNA sequencing data.
mLLMCelltype differs from traditional cell type annotation tools in several key ways:
No reference dataset required: Unlike reference-based methods, mLLMCelltype doesn’t require a pre-existing reference dataset.
Multi-model consensus: mLLMCelltype leverages multiple large language models to achieve more reliable annotations than any single model could provide.
Transparent reasoning: The package provides complete reasoning chains for annotations, making the process interpretable and transparent.
Uncertainty quantification: mLLMCelltype provides explicit uncertainty metrics (consensus proportion and Shannon entropy) to identify ambiguous cell populations.
Structured deliberation: For controversial clusters, mLLMCelltype initiates a structured discussion process among models to reach a more reliable consensus.
mLLMCelltype can annotate cell types from virtually any tissue and species, as it relies on the biological knowledge embedded in large language models rather than pre-defined reference datasets. However, performance may vary depending on how well-characterized the tissue is in the scientific literature.
The package has been extensively tested on: - Human tissues (PBMC, bone marrow, brain, lung, liver, kidney, etc.) - Mouse tissues (brain, lung, kidney, etc.) - Other model organisms (zebrafish, fruit fly, etc.)
For very specialized or poorly characterized tissues, the uncertainty metrics will help identify clusters that may require expert review.
In our benchmarks (Yang et al., 2025; see our paper), the consensus approach showed improvements over both traditional annotation methods and single-LLM approaches:
The accuracy advantage is particularly pronounced for rare cell types and tissues with limited reference data.
mLLMCelltype preserves your original cluster IDs as-is. Whether your clusters are numbered 0, 1, 2 (Seurat default) or 1, 2, 3 (R convention), or use custom names like “t_cells”, the package will use them directly without modification. The returned annotations use the same cluster IDs as your input.
The default setting uses the top 10 marker genes per cluster, which
works well for most scenarios. However, you can adjust this using the
top_gene_count parameter:
The optimal number depends on the quality of your marker genes and the complexity of the tissue. We recommend starting with the default of 10 and adjusting based on the results.
mLLMCelltype implements a caching system to avoid redundant API calls, which saves time and reduces costs:
cache = TRUE)cache_dir parameterTo clear the cache:
Note: The annotate_cell_types function does not have
built-in caching. If you need caching, you’ll need to implement it
separately.
The package includes error handling for API calls:
If you’re processing many clusters, you might encounter rate limits. In this case:
The runtime depends on several factors:
Typical runtimes: - Single model, 10 clusters: 1-2 minutes - Multi-model consensus (3 models), 10 clusters: 3-5 minutes - Multi-model consensus with discussion, 10 clusters: 5-10 minutes
To optimize runtime: - Implement your own caching mechanism if needed
- Start with fewer models for initial exploration - Use a higher
controversy_threshold to reduce the number of controversial
clusters - Process large datasets in batches
The API costs depend on the models you use and the number of clusters:
:free suffix)For a typical dataset with 10-20 clusters: - Single model annotation: $0.10-1.00 total - Multi-model consensus (3 models): $0.30-3.00 total - With discussion process: Additional $0.10-1.00 - Using OpenRouter free models: $0.00 total
To reduce costs: - Implement your own caching mechanism to avoid redundant API calls - Start with more economical models - Use fewer models for initial exploration - Reserve multi-model consensus for final analysis - Consider using OpenRouter free models (see below)
OpenRouter provides access to several high-quality models for free:
Sign up for an OpenRouter account at openrouter.ai
Get your API key from the OpenRouter dashboard
Use models with the :free
suffix:
# Set your OpenRouter API key
Sys.setenv(OPENROUTER_API_KEY = "your-openrouter-api-key")
# Use a free model
results <- annotate_cell_types(
input = marker_data,
tissue_name = "human PBMC",
model = "meta-llama/llama-4-maverick:free", # Note the :free suffix
api_key = Sys.getenv("OPENROUTER_API_KEY")
# No need to specify provider - it's automatically detected from the model name format
)meta-llama/llama-4-maverick:free - Meta Llama 4
Maverick (256K context, best performance)deepseek/deepseek-r1:free - DeepSeek R1 (advanced
reasoning)meta-llama/llama-3.3-70b-instruct:free - Meta Llama 3.3
70B (reliable)venice/uncensored:free - Venice Uncensored (new
model)minimax/minimax-m2:free - MiniMax M2 (optimized for
coding)z-ai/glm-4.5-air:free - GLM 4.5 Air (lightweight)Important Notes about Free Models (Updated Oct 2025): - Daily limits reduced: 50 requests/day for free accounts (down from 200) - Accounts with $10+ credits: 1000 requests/day - Rate limit: 20 requests/minute for all accounts - Some models removed: NVIDIA Nemotron and others have exited the free tier - Availability may change: Verify current free models at https://openrouter.ai/models?q=free - For production use: Consider using paid models for better reliability
To get the most accurate annotations:
Use multiple high-quality models: Include diverse, high-performing models like Claude 3.7, GPT-4o, and Gemini 1.5
Provide good marker genes: Use robust differential expression analysis to identify strong marker genes
Specify the correct tissue: Always provide the correct tissue name to give models the proper context
Review uncertainty metrics: Pay attention to consensus proportion and Shannon entropy to identify clusters that may need manual review
Examine discussion logs: For controversial clusters, review the discussion logs to understand the reasoning
Iterate if needed: If results are unsatisfactory, try adjusting parameters or providing additional context
There are several possible reasons for getting different results with the same input:
Model updates: LLMs are regularly updated, which can change their outputs
Temperature/sampling: Some randomness is inherent in LLM outputs
Context window limitations: Different runs might include slightly different context
API changes: Providers may change how their APIs work
To ensure reproducibility: - Implement your own caching mechanism to reuse results - Specify model versions explicitly when available - Save and document your results - Consider saving the raw API responses for future reference
If you see an error about invalid cluster indices, check that your cluster column contains valid values. mLLMCelltype accepts any cluster IDs (numeric or character) and preserves them as-is. Common issues:
cluster column exists in your data
frameNA values in the cluster columncluster
before calling the functionIf you get an error about missing API keys:
Check environment variables: Ensure your API
keys are set correctly in your .env file or
environment
Provide keys directly: Pass the API key directly to the function:
If specific cell types are not being correctly identified:
Check marker genes: Ensure the marker genes for these cell types are strong and specific
Provide more context: Specify the tissue type accurately to give models the right context
Use more models: Different models have different strengths; using multiple models improves coverage
Increase marker count: Try increasing
top_gene_count to include more marker genes
Review discussion logs: For controversial clusters, examine the discussion to understand the reasoning
Consider rare cell types: Some cell types may be poorly represented in the training data of LLMs
mLLMCelltype integrates with Seurat:
Input: You can directly use Seurat’s
FindAllMarkers() output as input
Output: Annotation results can be easily added to your Seurat object:
seurat_obj$cell_type_consensus <- plyr::mapvalues(
x = as.character(Idents(seurat_obj)),
from = names(consensus_results$final_annotations),
to = consensus_results$final_annotations
)Yes, you can use mLLMCelltype with Scanpy/AnnData objects in R:
Extract marker genes: Export marker genes from your Scanpy analysis to a CSV file
Run mLLMCelltype: Use the CSV file as input to mLLMCelltype
Import results: Add the annotation results back to your AnnData object
Alternatively, you can use the Python version of mLLMCelltype for direct integration with Scanpy.
mLLMCelltype can be used alongside traditional annotation methods:
Complementary approach: Use both methods and compare results
Validation: Use mLLMCelltype to validate annotations from reference-based methods
Hybrid approach: Use reference-based methods for well-characterized cell types and mLLMCelltype for novel or rare cell types
Ensemble method: Create a consensus between mLLMCelltype and traditional methods
While mLLMCelltype uses carefully designed prompts, advanced users can customize them:
# Create a custom annotation prompt
custom_prompt <- create_annotation_prompt(
marker_data = your_markers,
tissue_name = "your_tissue",
top_gene_count = 10,
custom_instructions = "Also consider developmental stage and activation state."
)
# Use the custom prompt directly
response <- get_model_response(
prompt = custom_prompt,
model = "claude-sonnet-4-5-20250929",
api_key = your_api_key
)Yes, you can register custom models and providers:
# Register a custom provider
register_custom_provider(
provider_name = "my_provider",
api_url = "https://api.my-provider.com/v1/chat/completions",
api_key_env_var = "MY_PROVIDER_API_KEY",
process_function = function(prompt, api_key) {
# Custom implementation
}
)
# Register a custom model
register_custom_model(
model_name = "my-custom-model",
provider = "my_provider"
)
# Use the custom model
results <- annotate_cell_types(
input = your_markers,
tissue_name = "your_tissue",
model = "my-custom-model",
api_key = your_api_key
)We welcome contributions! Here are some ways to contribute:
Report issues: Report bugs or suggest features on our GitHub repository
Improve documentation: Help us improve documentation and examples
Add new models: Implement support for new LLM models
Share benchmarks: Share your benchmarking results with different tissues and species
Develop new features: Contribute code for new features or improvements
See our Contributing Guide for more details.
Now that you have answers to common questions, you can explore:
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.