The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Database Formats and Multi-Contig Support

library(misha)

Overview

Starting with misha 5.3.0, databases can be stored in two formats:

The indexed format provides better performance and scalability, especially for genomes with many contigs (>50 chromosomes).

Key Features

Database Formats

Per-Chromosome Format (Legacy)

The per-chromosome format uses separate files:

Sequence data: - seq/chr1.seq, seq/chr2.seq, … - One file per chromosome

Track data: - tracks/mytrack.track/chr1.track, chr2.track, … - One file per chromosome

When to use: - Compatibility with older misha versions (<5.3.0) - Small genomes (<25 chromosomes) where performance difference is negligible

Creating Databases

New Databases (Indexed Format)

By default, new databases use the indexed format:

# Create database from FASTA file
gdb.create("mydb", "/path/to/genome.fa")

# Or download pre-built genome
gdb.create_genome("hg38", path = "/path/to/install")

Force Legacy Format

To create a database in legacy format (for compatibility with older misha):

# Set option before creating database
options(gmulticontig.indexed_format = FALSE)
gdb.create("mydb", "/path/to/genome.fa")

Checking Database Format

Use gdb.info() to check your database format:

gsetroot("/path/to/mydb")
info <- gdb.info()
print(info$format) # "indexed" or "per-chromosome"

Example output:

info <- gdb.info()
# $path
# [1] "/path/to/mydb"
#
# $is_db
# [1] TRUE
#
# $format
# [1] "indexed"
#
# $num_chromosomes
# [1] 24
#
# $genome_size
# [1] 3095693983

Converting Databases

Convert Entire Database

Convert all tracks and sequences to indexed format:

gsetroot("/path/to/mydb")
gdb.convert_to_indexed()

This will: 1. Convert sequence files (chr*.seqgenome.seq + genome.idx) 2. Convert all tracks to indexed format 3. Validate conversions 4. Remove old files after successful conversion

Convert Individual Tracks

Convert specific tracks while keeping others in legacy format:

gtrack.convert_to_indexed("mytrack")

Note that 2D tracks cannot be converted to indexed format yet.

Convert Intervals

Convert interval sets to indexed format:

# 1D intervals
gintervals.convert_to_indexed("myintervals")

# 2D intervals
gintervals.2d.convert_to_indexed("my2dintervals")

Migration Guide

When to Migrate

High priority (significant benefits): - Genomes with many contigs (>50 chromosomes) - Large-scale analyses (10M+ bp regions frequently) - 2D track workflows - File descriptor limit issues

Medium priority (moderate benefits): - Repeated extraction workflows - Regular analyses on medium-sized regions (1-10M bp)

Low priority (minimal benefits): - Small genomes (<25 chromosomes) - One-off analyses - Simple queries on small regions

Migration Workflow

Step 1: Backup (optional but recommended)

# Create backup of important database
system("cp -r /path/to/mydb /path/to/mydb.backup")

Step 2: Check current format

gsetroot("/path/to/mydb")
info <- gdb.info()
print(paste("Current format:", info$format))

Step 3: Convert

gdb.convert_to_indexed()

Step 4: Verify

# Check format changed
info <- gdb.info()
print(paste("New format:", info$format))

# Test a few operations
result <- gextract("mytrack", gintervals(1, 0, 1000))
print(head(result))

Step 5: Remove backup (after validation)

# After thorough testing
system("rm -rf /path/to/mydb.backup")

Copying Tracks Between Databases

You can freely copy tracks between databases with different formats.

Method 1: Export and Import

# Export from source database
gsetroot("/path/to/source_db")
gextract("mytrack", gintervals.all(),
    iterator = "mytrack",
    file = "/tmp/mytrack.txt"
)

# Import to target database (format auto-detected)
gsetroot("/path/to/target_db")
gtrack.import("mytrack", "Copied track", "/tmp/mytrack.txt", binsize = 0)
# Automatically converted to target database format!

Method 2: Batch Copy

# Copy multiple tracks
tracks <- c("track1", "track2", "track3")

for (track in tracks) {
    # Export
    gsetroot("/path/to/source_db")
    file_path <- sprintf("/tmp/%s.txt", track)
    gextract(track, gintervals.all(), iterator = track, file = file_path)

    # Import
    gsetroot("/path/to/target_db")
    info <- gtrack.info(track) # Get description
    gtrack.import(track, info$description, file_path, binsize = 0)
    unlink(file_path)
}

Performance Comparison

Based on comprehensive benchmarks comparing indexed vs legacy formats:

Operations Faster with Indexed Format

Operations Similar Performance

Summary

Backward Compatibility

Fully Compatible

Example: Mixed Environment

# Work with both formats in same session
gsetroot("/path/to/legacy_db")
data1 <- gextract("track1", gintervals(1, 0, 1000))

gsetroot("/path/to/indexed_db")
data2 <- gextract("track2", gintervals(1, 0, 1000))

Troubleshooting

“File descriptor limit reached”

This occurs with many-contig genomes in legacy format:

Solution: Convert to indexed format

gdb.convert_to_indexed()

“Track not found after copying files”

After manually copying track directories:

Solution: Reload database

gdb.reload()

“Conversion fails with disk space error”

Conversion needs 2x track size temporarily:

Solution: Free disk space or convert tracks individually

# Convert one track at a time
gtrack.convert_to_indexed("track1")
gtrack.convert_to_indexed("track2")
# etc.

Best Practices

For New Projects

  1. Use indexed format (default) for new databases
  2. Use gdb.create_genome() for standard genomes
  3. Use gdb.create() with multi-FASTA for custom genomes

For Existing Projects

  1. Check format with gdb.info()
  2. Migrate if beneficial (many contigs, large analyses)
  3. No rush - legacy format remains fully supported

Summary

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.