An easy and quick way to build a workflow is create two separate files. First is a table with commands to run, second has details regarding how the modules are stitched together. In the rest of this document we would refer to them as flow_mat and flow_def respectively.
Both these files have a jobname
column which is used as a ID to connect them to each other.
## ------ load some example data
ex = file.path(system.file(package = "flowr"), "pipelines")
flow_mat = as.flowmat(file.path(ex, "sleep_pipe.tsv"))
flow_def = as.flowdef(file.path(ex, "sleep_pipe.def"))
Each row in this table refers to one step of the pipeline. It describes the resources used by this step and also its relationship with other steps. Especially, the step immediately prior to it.
It is a tab separated file, with a minimum of 4 columns:
jobname
: Name of the stepsub_type
: Short for submission type, refers to, how should multiple commands of this step be submitted. Possible values are serial
or scatter
.prev_job
: Short for previous job, this would be jobname of the previous job. This can be NA/./none if this is a independent/initial step, and no previous step is required for this to start.dep_type
: Short for dependency type, refers to the relationship of this job with the one defined in prev_job
. This can take values none
, gather
, serial
or burst
.These would be explained in detail, below.
Apart from the above described variables, several others defining the resource requirements of each step are also available. These give great amount of flexibility to the user in choosing CPU, wall time, memory and queue for each step (and are passed along to the HPCC platform).
cpu_reserved
memory_reserved
nodes
walltime
queue
.. note:: This is especially useful for genomics pipelines, since each step may use different amount of resources. For example, in a typical setup, if one step uses 16 cores these would be blocked and not used during processing of several other steps. Thus resulting in blockage and high cluster load (even when actual CPU usage may be low). Being able to tune them, makes this setup quite efficient.
Most cluster platforms accept these resource arguments. Essentially a file like this is used as a template, and variables defined in curly braces ( ex. {{{CPU}}}
) are filled up using the flow definition file.
.. warning:: If these (resource requirements) columns not included in the flow_def, their values should be explicitly defined in the submission template.
Here is an example of a typical flow_def file.
jobname | sub_type | prev_jobs | dep_type | queue | memory_reserved | walltime | cpu_reserved | platform | jobid |
---|---|---|---|---|---|---|---|---|---|
sleep | scatter | none | none | short | 2000 | 1:00 | 1 | torque | 1 |
create_tmp | scatter | sleep | serial | short | 2000 | 1:00 | 1 | torque | 2 |
merge | serial | create_tmp | gather | short | 2000 | 1:00 | 1 | torque | 3 |
size | serial | merge | serial | short | 2000 | 1:00 | 1 | torque | 4 |
This is also a tab separated table, with a minimum of three columns as defined below:
samplename
: A grouping column. The table is split using this column and each subset is treated as a individual flow. This makes it very easy to process multiple samples using a single submission command.
jobname
: This corresponds to the name of the step. This should match exactly with the jobname column in flow_def table defined above.cmd
: A shell command to run. One can get quite creative here. These could be multiple shell commands separated by a ;
or &&
, more on this here. Though to keep this clean you may just wrap a multi-line command into a script and just source the bash script from here.Here is an example flow_mat.
samplename | jobname | cmd |
---|---|---|
sample1 | sleep | sleep 10 && sleep 2;echo hello |
sample1 | sleep | sleep 11 && sleep 8;echo hello |
sample1 | sleep | sleep 11 && sleep 17;echo hello |
sample1 | create_tmp | head -c 100000 /dev/urandom > sample1_tmp_1 |
sample1 | create_tmp | head -c 100000 /dev/urandom > sample1_tmp_2 |
sample1 | create_tmp | head -c 100000 /dev/urandom > sample1_tmp_3 |
sample1 | merge | cat sample1_tmp_1 sample1_tmp_2 sample1_tmp_3 > sample1_merged |
sample1 | size | du -sh sample1_merged; echo MY shell: $SHELL |
A —-> B —–> C —–> D
Consider an example with three steps A, B and C. A has 10 commands from A1 to A10, similarly B has 10 commands B1 through B10 and C has a single command, C1.
Consider another step D (with D1-D3), which comes after C.
This refers to the sub_type column in flow definition.
scatter
: submit all commands as parallel, independent jobs.
serial
: run these commands sequentially one after the other.
This refers to the dep_type column in flow definition.
none
: independent job.
serial
: one to one relationship with previous job.
gather
: many to one, wait for all commands in previous job to finish then start the current step.
burst
: one to many wait for the previous step which has one job and start processing all cmds in the current step.
Using the above submission and dependency types one can create several types of relationships between former and later jobs. Here are a few pipelines of relationships one may typically use.
[scatter] —serial—> [scatter]
A is submitted as scatter, A1 through A10. Further B1, requires A1 to complete; B2 requires A2 and so on, but they need not wait for all of step A jobs to complete. Also B1 through B10 are independent of each other.
To set this up, A and B would have sub_type
scatter
and B would have dep_type
as serial
. Further, since A is an initial step its dep_type
and prev_job
would defined as none
.
[scatter] —gather—> [serial]
Since C is a single command which requires all steps of B to complete, intuitively it needs to gather
pieces of data generated by B. In this case dep_type
would be gather
and sub_type
type would be serial
since it is a single command.
[serial] —burst—> [scatter]
Further, D is a set of three commands (D1-D3), which need to wait for a single process (C1) to complete. They would be submitted as scatter
after waiting on C in a burst
type dependency.
In essence and example flow_def would look like as follows (with additional resource requirements not shown for brevity).
ex2def = as.flowdef(file.path(ex, "abcd.def"))
ex2mat = as.flowmat(file.path(ex, "abcd.tsv"))
fobj = suppressMessages(to_flow(x = ex2mat, def = ex2def))
kable(ex2def[, 1:4])
jobname | sub_type | prev_jobs | dep_type |
---|---|---|---|
A | scatter | none | none |
B | scatter | A | serial |
C | serial | B | gather |
D | scatter | C | burst |
plot_flow(fobj)
.. note:: There is a darker more prominent shadow to indicate scatter steps.
The resource requirement columns of flow definition are passed along to the final (cluster) submission script.
The following table provides a mapping between the flow definition columns and variables in the submission template (pipelines below).
flow_def_column | hpc_script_variable |
---|---|
nodes | NODES |
cpu_reserved | CPU |
memory_reserved | MEMORY |
walltime | WALLTIME |
extra_opts | EXTRA_OPTS |
* | JOBNAME |
* | STDOUT |
* | CWD |
* | DEPENDENCY |
* | TRIGGER |
** | CMD |
Support for several popular cluster platforms are built-in. There is a template, each specific for a platform. These templates should would out of the box. You may copy and edit these (and save to ~/flowr/conf) in case some changes are required. Templates from this folder (~/flowr/conf), would override the defaults.
Here are a few details on adding a new platform: github.com/sahilseth/flowr/issues/7
.. note:: My HPCC is not supported, how to make it work? Take a look at: adding platforms and send a message to: sahil.seth [at] me.com
*: These are generated on the fly **: This is gathered from flow_mat