The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Subsample table in pepr

Michal Stolarczyk & Nathan Sheffield

2023-11-21

Learn sample subannotations in pepr

This vignette will show you how and why to use the subsample table functionality of the pepr package.

Problem/Goal

This series of examples below demonstrates how and why to use sample subannoatation functionality in multiple cases to provide multiple input files of the same type for a single sample.

Solutions

Example 1: basic sample subannotation table

This example demonstrates how the sample subannotation functionality is used. In this example, 2 samples have multiple input files that need merging (frog_1 and frog_2), while 1 sample (frog_3) does not. Therefore, frog_3 specifies its file in the sample_table.csv file, while the others leave that field blank and instead specify several files in the subsample_table.csv file.

This example is made up of these components:

  • Project config file:
   pep_version: 2.0.0
   sample_table: sample_table.csv
   subsample_table: subsample_table.csv
   looper:
      output_dir: $HOME/example_results
  • Sample table:
    sample_name protocol file
    frog_1 anySampleType multi
    frog_2 anySampleType multi
    frog_3 anySampleType multi
  • Subsample table:
    sample_name subsample_name file
    frog_1 sub_a data/frog1a_data.txt
    frog_1 sub_b data/frog1b_data.txt
    frog_1 sub_c data/frog1c_data.txt
    frog_2 sub_a data/frog2a_data.txt
    frog_2 sub_b data/frog2b_data.txt

Let’s create the Project object and see if multiple files are present

projectConfig1 = system.file(
"extdata",
paste0("example_peps-", branch),
"example_subtable1",
"project_config.yaml",
package = "pepr"
)
p1 = Project(projectConfig1)
#> Loading config file: /tmp/RtmpoymTo9/Rinstb3055bff7/pepr/extdata/example_peps-master/example_subtable1/project_config.yaml
# Check the files
p1Samples = sampleTable(p1)
p1Samples$file
#> [[1]]
#> [1] "data/frog1a_data.txt" "data/frog1b_data.txt" "data/frog1c_data.txt"
#> 
#> [[2]]
#> [1] "data/frog2a_data.txt" "data/frog2b_data.txt"
#> 
#> [[3]]
#> [1] "multi"
# Check the subsample names
p1Samples$subsample_name
#> [[1]]
#> [1] "sub_a" "sub_b" "sub_c"
#> 
#> [[2]]
#> [1] "sub_a" "sub_b"
#> 
#> [[3]]
#> NULL

And inspect the whole table in p1@samples slot

sample_name protocol file subsample_name
frog_1 anySampleType data/frog1a_data.txt, data/frog1b_data.txt, data/frog1c_data.txt sub_a, sub_b, sub_c
frog_2 anySampleType data/frog2a_data.txt, data/frog2b_data.txt sub_a, sub_b
frog_3 anySampleType multi NULL

You can also access a single subsample if you call the getSubsample method with appropriate sample_name - subsample_name attribute combination. Note, that this is only possible if the subsample_name column is defined in the sub_annotation.csv file.

sampleName = "frog_1"
subsampleName = "sub_a"
getSubsample(p1, sampleName, subsampleName)
#>    sample_name      protocol                 file subsample_name
#> 1:      frog_1 anySampleType data/frog1a_data.txt          sub_a

Example 2: subannotations and derived attributes

This example uses a subsample_table.csv file and a derived attributes to point to files. This is a rather complex example. Notice we must include the file_id column in the sample_table.csv file, and leave it blank; this is then populated by just some of the samples (frog_1 and frog_2) in the subsample_table.csv, but is left empty for the samples that are not merged.

This example is made up of these components:

  • Project config file:
   pep_version: 2.0.0
   sample_table: sample_table.csv
   subsample_table: subsample_table.csv
   looper:
      output_dir: $HOME/hello_looper_results
      pipeline_interfaces: ../pipeline/pipeline_interface.yaml
   sample_modifiers:
      derive:
          attributes: file
          sources:
              local_files: ../data/{identifier}{file_id}_data.txt
              local_files_unmerged: ../data/{identifier}_data.txt
  • Sample annotation table:
    sample_name protocol identifier file
    frog_1 anySampleType frog1 local_files
    frog_2 anySampleType frog2 local_files
    frog_3 anySampleType frog3 local_files_unmerged
    frog_4 anySampleType frog4 local_files_unmerged
  • Sample subannotation table:
    sample_name file_id subsample_name
    frog_1 a a
    frog_1 b b
    frog_1 c c
    frog_2 a a
    frog_2 b b

Let’s load the project config, create the Project object and see if multiple files are present

projectConfig2 = system.file(
"extdata",
paste0("example_peps-", branch),
"example_subtable2",
"project_config.yaml",
package = "pepr"
)
p2 = Project(projectConfig2)
#> Loading config file: /tmp/RtmpoymTo9/Rinstb3055bff7/pepr/extdata/example_peps-master/example_subtable2/project_config.yaml
#> Warning in `[<-.data.frame`(x, i, j, value): replacement element 1 has 3 rows
#> to replace 1 rows
#> Warning in `[<-.data.frame`(x, i, j, value): replacement element 1 has 2 rows
#> to replace 1 rows
# Check the files
p2Samples = sampleTable(p2)
p2Samples$file
#> [[1]]
#> [1] "../data/frog1a_data.txt"
#> 
#> [[2]]
#> [1] "../data/frog2a_data.txt"
#> 
#> [[3]]
#> [1] "../data/frog3_data.txt"
#> 
#> [[4]]
#> [1] "../data/frog4_data.txt"

And inspect the whole table in p2@samples slot

sample_name protocol identifier file file_id subsample_name
frog_1 anySampleType frog1 ../data/frog1a_data.txt a, b, c a, b, c
frog_2 anySampleType frog2 ../data/frog2a_data.txt a, b a, b
frog_3 anySampleType frog3 ../data/frog3_data.txt NULL NULL
frog_4 anySampleType frog4 ../data/frog4_data.txt NULL NULL

Example 3: subannotations and expansion characters

This example gives the exact same results as Example 2, but in this case, uses a wildcard for frog_2 instead of including it in the subsample_table.csv file. Since we can’t use a wildcard and a subannotation for the same sample, this necessitates specifying a second data source class (local_files_unmerged) that uses an asterisk (*). The outcome is the same.

This example is made up of these components:

  • Project config file:
   pep_version: 2.0.0
   sample_table: sample_table.csv
   subsample_table: subsample_table.csv
   looper:
      output_dir: $HOME/hello_looper_results
      pipeline_interfaces: ../pipeline/pipeline_interface.yaml
   sample_modifiers:
      derive:
          attributes: file
          sources:
              local_files: ../data/{identifier}{file_id}_data.txt
              local_files_unmerged: ../data/{identifier}*_data.txt
  • Sample annotation table:
    sample_name protocol identifier file file_id
    frog_1 anySampleType frog1 local_files NA
    frog_2 anySampleType frog2 local_files_unmerged NA
    frog_3 anySampleType frog3 local_files_unmerged NA
    frog_4 anySampleType frog4 local_files_unmerged NA
  • Sample subtable table:
    sample_name file_id
    frog_1 a
    frog_1 b
    frog_1 c

Let’s load the project config, create the Project object and see if multiple files are present

projectConfig3 = system.file(
"extdata",
paste0("example_peps-", branch),
"example_subtable3",
"project_config.yaml",
package = "pepr"
)
p3 = Project(projectConfig3)
#> Loading config file: /tmp/RtmpoymTo9/Rinstb3055bff7/pepr/extdata/example_peps-master/example_subtable3/project_config.yaml
#> Warning in `[<-.data.frame`(x, i, j, value): replacement element 1 has 3 rows
#> to replace 1 rows
# Check the files
p3Samples = sampleTable(p3)
p3Samples$file
#> [[1]]
#> [1] "../data/frog1a_data.txt"
#> 
#> [[2]]
#> [1] "../data/frog2*_data.txt"
#> 
#> [[3]]
#> [1] "../data/frog3*_data.txt"
#> 
#> [[4]]
#> [1] "../data/frog4*_data.txt"

And inspect the whole table in p3@samples slot

sample_name protocol identifier file file_id
frog_1 anySampleType frog1 ../data/frog1a_data.txt a, b, c
frog_2 anySampleType frog2 ../data/frog2*_data.txt
frog_3 anySampleType frog3 ../data/frog3*_data.txt
frog_4 anySampleType frog4 ../data/frog4*_data.txt

Example 4: subannotations and multiple (separate-class) inputs

Merging is for same class inputs (like, multiple files for read1). Different-class inputs (like read1 vs read2) are handled by different attributes (or columns). This example shows you how to handle paired-end data, while also merging within each.

This example is made up of these components:

  • Project config file:
   pep_version: 2.0.0
   sample_table: sample_table.csv
   subsample_table: subsample_table.csv
   looper:
      output_dir: $HOME/hello_looper_results
      pipeline_interfaces: ../pipeline/pipeline_interface.yaml
  • Sample annotation table:
    sample_name protocol
    frog_1 anySampleType
    frog_2 anySampleType
    frog_3 anySampleType
    frog_4 anySampleType
  • Sample subannotation table:
    sample_name read1 read2
    frog_1 frog1a_data.txt frog1a_data2.txt
    frog_1 frog1b_data.txt frog1b_data2.txt
    frog_1 frog1c_data.txt frog1b_data2.txt

Let’s load the project config, create the Project object and see if multiple files are present

projectConfig4 = system.file(
"extdata",
paste0("example_peps-", branch),
"example_subtable4",
"project_config.yaml",
package = "pepr"
)
p4 = Project(projectConfig4)
#> Loading config file: /tmp/RtmpoymTo9/Rinstb3055bff7/pepr/extdata/example_peps-master/example_subtable4/project_config.yaml
# Check the read1 and read2 columns
p4Samples = sampleTable(p4)
p4Samples$read1
#> [[1]]
#> [1] "frog1a_data.txt" "frog1b_data.txt" "frog1c_data.txt"
#> 
#> [[2]]
#> NULL
#> 
#> [[3]]
#> NULL
#> 
#> [[4]]
#> NULL
p4Samples$read2
#> [[1]]
#> [1] "frog1a_data2.txt" "frog1b_data2.txt" "frog1b_data2.txt"
#> 
#> [[2]]
#> NULL
#> 
#> [[3]]
#> NULL
#> 
#> [[4]]
#> NULL

And inspect the whole table in p4@samples slot

sample_name protocol read1 read2
frog_1 anySampleType frog1a_data.txt, frog1b_data.txt, frog1c_data.txt frog1a_data2.txt, frog1b_data2.txt, frog1b_data2.txt
frog_2 anySampleType NULL NULL
frog_3 anySampleType NULL NULL
frog_4 anySampleType NULL NULL

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.