The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Database Formats and Multi-Contig Support

Overview

Starting with misha 5.3.0, databases can be stored in two formats:

Indexed format (default, recommended): Single unified files for sequences and tracks
Per-chromosome format (legacy): Separate files for each chromosome

The indexed format provides better performance and scalability, especially for genomes with many contigs (>50 chromosomes).

Key Features

Automatic format detection - misha automatically detects which format your database uses
Fully backward compatible - existing databases continue to work without modification
Transparent to users - same API for both formats
Migration tools - convert databases when convenient
Performance benefits - 4-14% faster for large-scale analyses

Database Formats

Indexed Format (Recommended)

The indexed format uses unified files:

Sequence data: - seq/genome.seq - All chromosome sequences concatenated - seq/genome.idx - Index mapping chromosome names to positions

Track data: - tracks/mytrack.track/track.dat - All chromosome data concatenated - tracks/mytrack.track/track.idx - Index with offset/length per chromosome

Advantages: - Fewer file descriptors (important for genomes with 100+ contigs) - Better performance for large workloads (14% faster) - Smaller disk footprint - Faster track creation and conversion

Per-Chromosome Format (Legacy)

The per-chromosome format uses separate files:

Sequence data: - seq/chr1.seq, seq/chr2.seq, … - One file per chromosome

Track data: - tracks/mytrack.track/chr1.track, chr2.track, … - One file per chromosome

When to use: - Compatibility with older misha versions (<5.3.0) - Small genomes (<25 chromosomes) where performance difference is negligible

Creating Databases

New Databases (Indexed Format)

By default, new databases use the indexed format:

# Create database from FASTA file
gdb.create("mydb", "/path/to/genome.fa")

# Or download pre-built genome
gdb.create_genome("hg38", path = "/path/to/install")

Force Legacy Format

To create a database in legacy format (for compatibility with older misha):

# Set option before creating database
options(gmulticontig.indexed_format = FALSE)
gdb.create("mydb", "/path/to/genome.fa")

Checking Database Format

Use gdb.info() to check your database format:

gsetroot("/path/to/mydb")
info <- gdb.info()
print(info$format) # "indexed" or "per-chromosome"

Example output:

info <- gdb.info()
# $path
# [1] "/path/to/mydb"
#
# $is_db
# [1] TRUE
#
# $format
# [1] "indexed"
#
# $num_chromosomes
# [1] 24
#
# $genome_size
# [1] 3095693983

Converting Databases

Convert Entire Database

Convert all tracks and sequences to indexed format:

gsetroot("/path/to/mydb")
gdb.convert_to_indexed()

This will: 1. Convert sequence files (chr*.seq → genome.seq + genome.idx) 2. Convert all tracks to indexed format 3. Validate conversions 4. Remove old files after successful conversion

Convert Individual Tracks

Convert specific tracks while keeping others in legacy format:

gtrack.convert_to_indexed("mytrack")

Note that 2D tracks cannot be converted to indexed format yet.

Convert Intervals

Convert interval sets to indexed format:

# 1D intervals
gintervals.convert_to_indexed("myintervals")

# 2D intervals
gintervals.2d.convert_to_indexed("my2dintervals")

Migration Guide

When to Migrate

High priority (significant benefits): - Genomes with many contigs (>50 chromosomes) - Large-scale analyses (10M+ bp regions frequently) - 2D track workflows - File descriptor limit issues

Medium priority (moderate benefits): - Repeated extraction workflows - Regular analyses on medium-sized regions (1-10M bp)

Low priority (minimal benefits): - Small genomes (<25 chromosomes) - One-off analyses - Simple queries on small regions

Migration Workflow

Step 1: Backup (optional but recommended)

# Create backup of important database
system("cp -r /path/to/mydb /path/to/mydb.backup")

Step 2: Check current format

gsetroot("/path/to/mydb")
info <- gdb.info()
print(paste("Current format:", info$format))

Step 3: Convert

gdb.convert_to_indexed()

Step 4: Verify

# Check format changed
info <- gdb.info()
print(paste("New format:", info$format))

# Test a few operations
result <- gextract("mytrack", gintervals(1, 0, 1000))
print(head(result))

Step 5: Remove backup (after validation)

# After thorough testing
system("rm -rf /path/to/mydb.backup")

Copying Tracks Between Databases

You can freely copy tracks between databases with different formats.

Method 1: Export and Import

# Export from source database
gsetroot("/path/to/source_db")
gextract("mytrack", gintervals.all(),
    iterator = "mytrack",
    file = "/tmp/mytrack.txt"
)

# Import to target database (format auto-detected)
gsetroot("/path/to/target_db")
gtrack.import("mytrack", "Copied track", "/tmp/mytrack.txt", binsize = 0)
# Automatically converted to target database format!

Method 2: Batch Copy

# Copy multiple tracks
tracks <- c("track1", "track2", "track3")

for (track in tracks) {
    # Export
    gsetroot("/path/to/source_db")
    file_path <- sprintf("/tmp/%s.txt", track)
    gextract(track, gintervals.all(), iterator = track, file = file_path)

    # Import
    gsetroot("/path/to/target_db")
    info <- gtrack.info(track) # Get description
    gtrack.import(track, info$description, file_path, binsize = 0)
    unlink(file_path)
}

Performance Comparison

Based on comprehensive benchmarks comparing indexed vs legacy formats:

Operations Faster with Indexed Format

Very large workloads (10M+ bp): 14% faster
2D track operations: 2% faster (was 36% slower in legacy)
Repeated extractions: 14% faster
Real workflows: 8% faster on average

Operations Similar Performance

Single chromosome extraction: Within 5%
Multi-chromosome (10-22 chr): Within 1%
PWM operations: Within 3%
Small workloads (<1M bp): Within 10%

Summary

64% of operations are faster with indexed format
93% within 5% of legacy format (statistically equal)
Average: 4% faster across all benchmarks
No regressions for common use cases

Backward Compatibility

Fully Compatible

✅ Existing databases work without modification
✅ Existing scripts work without changes
✅ Same API for both formats
✅ Automatic format detection
✅ Can mix formats in same analysis

Example: Mixed Environment

# Work with both formats in same session
gsetroot("/path/to/legacy_db")
data1 <- gextract("track1", gintervals(1, 0, 1000))

gsetroot("/path/to/indexed_db")
data2 <- gextract("track2", gintervals(1, 0, 1000))

Troubleshooting

“File descriptor limit reached”

This occurs with many-contig genomes in legacy format:

Solution: Convert to indexed format

gdb.convert_to_indexed()

“Track not found after copying files”

After manually copying track directories:

Solution: Reload database

gdb.reload()

“Conversion fails with disk space error”

Conversion needs 2x track size temporarily:

Solution: Free disk space or convert tracks individually

# Convert one track at a time
gtrack.convert_to_indexed("track1")
gtrack.convert_to_indexed("track2")
# etc.

Best Practices

For New Projects

Use indexed format (default) for new databases
Use gdb.create_genome() for standard genomes
Use gdb.create() with multi-FASTA for custom genomes

For Existing Projects

Check format with gdb.info()
Migrate if beneficial (many contigs, large analyses)
No rush - legacy format remains fully supported

Summary

Indexed format is the default and recommended for new databases
Legacy format remains fully supported - no forced migration
Automatic format detection - users don’t need to think about it
Significant performance benefits for large-scale analyses
Easy migration with gdb.convert_to_indexed()
Fully backward compatible - existing code works unchanged

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.