The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
library(immundata)
library(duckplyr)
<- system.file("extdata", "metadata_samples.tsv", package = "immundata")
md_path <- c(
samples system.file("extdata", "sample_0_1k.tsv", package = "immundata"),
system.file("extdata", "sample_1k_2k.tsv", package = "immundata")
)
<- load_metadata(md_path)
md <- load_repertoires(samples, c("cdr3_aa", "v_call"), md) imdata
parquet etc.
immundata
Suppose you have several files. How to read them?
immundata
modularizes different parts to make sure ???
(modularity / one big function is bad). Henceforth,
immundata
splits the repertoire dataset loading into three
steps:
Optionally, load the metadata via
load_metadata
Load the repertoire files from the disk via
load_repertoires
and convert them into
immundata
files.
Load the ImmunData files from the converted files via
load_immundata
as the final step of
load_repertoires
.
After converting the files to the immundata
format, you
can load them directly with load_immundata
.
This is the key concept that distinguished immundata
from DataFrame-based libraries.
???
…
…
…
…
…
Take a look at immunarch
.
…
…
immundata
is free to use for commercial usage. However,
corporate users will not get a prioritized support for
immundata
-related issues, immune repertoire analysis
questions or data engineering questions, related to building scalable
immune repertoire and other -omics pipelines. The priority of
open-source tool immundata
is open-source science.
If you are looking for prioritized support and setting up your data pipelines, consider contacting Vadim Nazarov for commercial consulting and support options.
Q: Why all the function names or ImmunData fields are so
long? I want to write imdata$rec
instead of
imdata$receptors
.
A: Two major reasons - improving the code readability and motivation to leverage the autocomplete tools.
Q: How does immundata
works under the hood,
in simpler terms?
A: immundata
uses the fantastic duckplyr
package
References:
Q: Why do you need to create Parquet files with receptors and annotations?
A: First of all, you can turn it off. Second, those are intermediate
files, optimized for future data operations, and working with them
significantly accelerates immundata
. Take a look at our
benchmark page to learn more: link
Q: Why does immundata
support only the AIRR
standard?!
A: Because standards, but immundata
allows some level of
optionality - you can provide column names for barcodes, etc.
Q: Why is it so complex? Why do we need to use
dplyr
instead of plain R?
A: The short answer is:
For the long answer, let me give you more details on each of the bullepoint.
Q: How do I get to use all operations from
dplyr
? duckplyr
doesn’t support some
operations, which I need.
A: Let’s consider several use cases.
Case 0. You are missing group_by
from
dplyr
.
Case 1. Your data can fit into RAM.
Case 2. Your data won’t fit into RAM, and you really need to work on all of this data.
Case 3. Your data won’t fit into RAM, but before running intensive computations, you are open to working with smaller dataset first.
Q: You filter out non-productive receptors. How do I explore them?
A: option for saving non-productive chains to a separate file
Q: Why does immundata
have its own column
names for receptors and repertoires? Could you just use the AIRR format
- repertoire_id etc.?
A: The power of immundata
lies in the fast
re-aggregation of the data, that allows to work with whatever you define
as a repertoire on the fly via
ImmunData$build_repertoires(schema = ...)
Q: What do I do with following error: “Error in
compute_parquet()
at […]: !
{”exception_type”:“IO”,“exception_message”:“Failed to write […]: Failed
to read file […]: schema mismatch in glob: column […] was read from the
original file […], but could not be found in file […] If you are trying
to read files with different schemas, try setting
union_by_name=True”}?*
A: It means that your repertoire files have different schemas, i.e., different column names. You have two options.
Option 1: Check the data and fix the schema. Explore the reason why the data have different schemas. Remove wrong files. Change column names. And try again.
Option 2: If you know what you are doing, pass
argument enforce_schema = FALSE
to
load_repertoires
. The resultant table will have NAs in the
place of missing values. But don’t use it without considering the first
option. Broken schema usually means that there are some issues in the
how the data were processed.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.