The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Starting with misha 5.3.0, databases can be stored in two formats:
The indexed format provides better performance and scalability, especially for genomes with many contigs (>50 chromosomes).
The indexed format uses unified files:
Sequence data: - seq/genome.seq - All
chromosome sequences concatenated - seq/genome.idx - Index
mapping chromosome names to positions
Track data: -
tracks/mytrack.track/track.dat - All chromosome data
concatenated - tracks/mytrack.track/track.idx - Index with
offset/length per chromosome
Advantages: - Fewer file descriptors (important for genomes with 100+ contigs) - Better performance for large workloads (14% faster) - Smaller disk footprint - Faster track creation and conversion
The per-chromosome format uses separate files:
Sequence data: - seq/chr1.seq,
seq/chr2.seq, … - One file per chromosome
Track data: -
tracks/mytrack.track/chr1.track, chr2.track, …
- One file per chromosome
When to use: - Compatibility with older misha versions (<5.3.0) - Small genomes (<25 chromosomes) where performance difference is negligible
By default, new databases use the indexed format:
Use gdb.info() to check your database format:
Example output:
Convert all tracks and sequences to indexed format:
This will: 1. Convert sequence files (chr*.seq →
genome.seq + genome.idx) 2. Convert all tracks to indexed
format 3. Validate conversions 4. Remove old files after successful
conversion
Convert specific tracks while keeping others in legacy format:
Note that 2D tracks cannot be converted to indexed format yet.
High priority (significant benefits): - Genomes with many contigs (>50 chromosomes) - Large-scale analyses (10M+ bp regions frequently) - 2D track workflows - File descriptor limit issues
Medium priority (moderate benefits): - Repeated extraction workflows - Regular analyses on medium-sized regions (1-10M bp)
Low priority (minimal benefits): - Small genomes (<25 chromosomes) - One-off analyses - Simple queries on small regions
Step 1: Backup (optional but recommended)
Step 2: Check current format
Step 3: Convert
Step 4: Verify
# Check format changed
info <- gdb.info()
print(paste("New format:", info$format))
# Test a few operations
result <- gextract("mytrack", gintervals(1, 0, 1000))
print(head(result))Step 5: Remove backup (after validation)
You can freely copy tracks between databases with different formats.
# Export from source database
gsetroot("/path/to/source_db")
gextract("mytrack", gintervals.all(),
iterator = "mytrack",
file = "/tmp/mytrack.txt"
)
# Import to target database (format auto-detected)
gsetroot("/path/to/target_db")
gtrack.import("mytrack", "Copied track", "/tmp/mytrack.txt", binsize = 0)
# Automatically converted to target database format!# Copy multiple tracks
tracks <- c("track1", "track2", "track3")
for (track in tracks) {
# Export
gsetroot("/path/to/source_db")
file_path <- sprintf("/tmp/%s.txt", track)
gextract(track, gintervals.all(), iterator = track, file = file_path)
# Import
gsetroot("/path/to/target_db")
info <- gtrack.info(track) # Get description
gtrack.import(track, info$description, file_path, binsize = 0)
unlink(file_path)
}Based on comprehensive benchmarks comparing indexed vs legacy formats:
This occurs with many-contig genomes in legacy format:
Solution: Convert to indexed format
After manually copying track directories:
Solution: Reload database
gdb.create_genome() for standard genomesgdb.create() with multi-FASTA for custom
genomesgdb.info()gdb.convert_to_indexed()These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.