The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Getting Started

rsynthbio is an R package that provides a convenient interface to the Synthesize Bio API, allowing users to generate realistic gene expression data based on specified biological conditions. This package enables researchers to easily access AI-generated transcriptomic data for various modalities including bulk RNA-seq and single-cell RNA-seq.

Alternatively, you can AI generate datasets from our web platform.

How to install

You can install rsynthbio from CRAN:

install.packages("rsynthbio")

If you want the development version, you can install using the remotes package to install from GitHub:

if (!("remotes" %in% installed.packages())) {
  install.packages("remotes")
}
remotes::install_github("synthesizebio/rsynthbio")

Once installed, load the package:

library(rsynthbio)

Authentication

Before using the Synthesize Bio API, you need to set up your API token. The package provides a secure way to handle authentication:

# Securely prompt for and store your API token
# The token will not be visible in the console
set_synthesize_token()

# You can also store the token in your system keyring for persistence
# across R sessions (requires the 'keyring' package)
set_synthesize_token(use_keyring = TRUE)

Loading your API key for a session.

# In future sessions, load the stored token
load_synthesize_token_from_keyring()

# Check if a token is already set
has_synthesize_token()

You can obtain an API token by registering at Synthesize Bio.

Security Best Practices

For security reasons, remember to clear your token when you’re done:

# Clear token from current session
clear_synthesize_token()

# Clear token from both session and keyring
clear_synthesize_token(remove_from_keyring = TRUE)

Never hard-code your token in scripts that will be shared or committed to version control.

Designing Queries for Models

Choosing a Modality

The modality (data type to generate) is specified in the query using get_valid_query():

You can check which modalities are available programmatically:

# Check available modalities
get_valid_modalities()

You do not need to specify any internal API slugs. The library maps modalities to the appropriate model endpoints automatically.

# Create a bulk query
bulk_query <- get_valid_query(modality = "bulk")
bulk <- predict_query(bulk_query, as_counts = TRUE)

# Create a single-cell query
sc_query <- get_valid_query(modality = "single-cell")
sc <- predict_query(sc_query, as_counts = TRUE)

Creating a Query

The structure of the query required by the API is fixed for the currently supported model. You can use get_valid_query() to get a correctly structured example list.

# Get the example query structure
example_query <- get_valid_query()

# Inspect the query structure
str(example_query)

The query consists of:

  1. modality: The type of gene expression data to generate (“bulk” or “single-cell”)
  2. mode: The prediction mode that controls how expression data is generated:
    • “sample generation”: Generates realistic-looking synthetic data with measurement error (bulk only)
    • “mean estimation”: Provides stable mean estimates of expression levels (bulk and single-cell)
  3. inputs: A list of biological conditions to generate data for

Each input contains metadata (describing the biological sample) and num_samples (how many samples to generate).

See the Query Parameters section below for detailed documentation on mode and other optional query fields.

Making a Prediction

Once your query is ready, you can send it to the API to generate gene expression data:

result <- predict_query(query, as_counts = TRUE)

This result will be a list of two dataframes: metadata and expression

Understanding the Async API

Behind the scenes, the API uses an asynchronous model to handle queries efficiently:

  1. Your query is submitted to the API, which returns a query ID
  2. The function automatically polls the status endpoint (default: every 2 seconds)
  3. When the query completes, results are downloaded from a signed URL
  4. Data is parsed and returned as R data frames

All of this happens automatically when you call predict_query().

Controlling Async Behavior

You can customize the polling behavior if needed:

# Increase timeout for large queries (default: 900 seconds = 15 minutes)
result <- predict_query(
  query,
  poll_timeout_seconds = 1800, # 30 minutes
  poll_interval_seconds = 5 # Check every 5 seconds instead of 2
)

Valid Metadata Keys

The input metadata is a list of lists. This is the full list of valid metadata keys:

Biological:

Perturbational:

Technical:

Valid Metadata Values

The following are the valid values or expected formats for selected metadata keys:

Metadata Field Requirement / Example
cell_line_ontology_id Requires a Cellosaurus ID.
cell_type_ontology_id Requires a CL ID.
disease_ontology_id Requires a MONDO ID.
perturbation_ontology_id Must be a valid Ensembl gene ID (e.g., ENSG00000156127), ChEBI ID (e.g., CHEBI:16681), ChEMBL ID (e.g., CHEMBL1234567), or NCBI Taxonomy ID (e.g., 9606).
tissue_ontology_id Requires a UBERON ID.

We highly recommend using the EMBL-EBI Ontology Lookup Service to find valid IDs for your metadata.

Models have a limited acceptable range of metadata input values. If you provide a value that is not in the acceptable range, the API will return an error.

Query Parameters

In addition to metadata, queries support several optional parameters that control the generation process:

mode (character, required)

Controls the type of prediction the model generates. This parameter is required in all queries.

Available modes:

  • “sample generation”: The model works identically to the mean estimation approach, except that the final gene expression distribution is also sampled to generate realistic-looking synthetic data that captures the error associated with measurements. This mode is useful when you want data that mimics real experimental measurements.

  • “mean estimation”: The model creates a distribution capturing the biological heterogeneity consistent with the supplied metadata. This distribution is then sampled to predict a gene expression distribution that captures measurement error. The mean of that distribution serves as the prediction. This mode is useful when you want a stable estimate of expected expression levels.

Note: Single-cell queries only support “mean estimation” mode. Bulk queries support both modes.

# Bulk query with sample generation (default for bulk)
bulk_query <- get_valid_query(modality = "bulk")
bulk_query$mode <- "sample generation"

# Bulk query with mean estimation
bulk_query_mean <- get_valid_query(modality = "bulk")
bulk_query_mean$mode <- "mean estimation"

# Single-cell query (must use mean estimation)
sc_query <- get_valid_query(modality = "single-cell")
sc_query$mode <- "mean estimation" # Required for single-cell

total_count (integer, optional)

Library size used when converting predicted log CPM back to raw counts. Higher values scale counts up proportionally.

# Create a query and add custom total_count
query <- get_valid_query(modality = "bulk")
query$total_count <- 5000000

deterministic_latents (logical, optional)

If TRUE, the model uses the mean of each latent distribution (p(z|metadata) or q(z|x)) instead of sampling. This removes randomness from latent sampling and produces deterministic outputs for the same inputs.

  • Default: FALSE (sampling is enabled)
# Create a query and enable deterministic latents
query <- get_valid_query(modality = "bulk")
query$deterministic_latents <- TRUE

seed (integer, optional)

Random seed for reproducibility when using stochastic sampling.

# Create a query with a specific seed
query <- get_valid_query(modality = "bulk")
query$seed <- 42

You can combine multiple parameters in a single query:

# Create a query and add multiple parameters
query <- get_valid_query(modality = "bulk")
query$total_count <- 8000000
query$deterministic_latents <- TRUE
query$mode <- "mean estimation"

results <- predict_query(query)

Modifying Query Inputs

You can customize the query inputs to fit your specific research needs:

# Get a base query
query <- get_valid_query()

# Adjust number of samples for the first input
query$inputs[[1]]$num_samples <- 10

# Add a new condition
query$inputs[[3]] <- list(
  metadata = list(
    sex = "male",
    sample_type = "primary tissue",
    tissue_ontology_id = "UBERON:0002371"
  ),
  num_samples = 5
)

Working with Results

# Access metadata and expression matrices
metadata <- result$metadata
expression <- result$expression

# Check dimensions
dim(expression)

# View metadata sample
head(metadata)

You may want to process the data in chunks or save it for later use:

# Save results to RDS file
saveRDS(result, "synthesize_results.rds")

# Load previously saved results
result <- readRDS("synthesize_results.rds")

# Export as CSV
write.csv(result$expression, "expression_matrix.csv")
write.csv(result$metadata, "sample_metadata.csv")

Custom Validation

You can validate your queries before sending them to the API:

# Validate structure
validate_query(query)

# Validate modality
validate_modality(query)

Session info

sessionInfo()

Additional Resources

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.