This answer is similar to @Shiping's answer, which is to use wildcards in the output
of a rule to implement multiple parameters per input file. However, this answer provides a more detailed example and avoids using complex list comprehension, regular expression, or the glob
module.
@Pereira Hugo's approach uses one job to run all parameter combinations for one input file, whereas the approach in this answer uses one job to run one parameter combination for one input file, which makes it easier to parallelize the execution of each parameter combination on one input file.
Snakefile
:
import os
data_dir = 'data'
sample_fns = os.listdir(data_dir)
sample_pfxes = list(map(lambda p: p[:p.rfind('.')],
sample_fns))
res_dir = 'results'
params1 = [1, 2]
params2 = ['a', 'b', 'c']
rule all:
input:
expand(os.path.join(res_dir, '{sample}_p1_{param1}_p2_{param2}.csv'),
sample=sample_pfxes, param1=params1, param2=params2)
rule preprocess:
input:
csv=os.path.join(data_dir, '{sample}.csv')
output:
csv=os.path.join(res_dir, '{sample}_p1_{param1}_p2_{param2}.csv')
shell:
"ls {input.csv} && \
echo P1: {wildcards.param1}, P2: {wildcards.param2} > {output.csv}"
Directory structure before running snakemake
:
$ tree .
.
├── Snakefile
├── data
│ ├── sample_1.csv
│ ├── sample_2.csv
│ └── sample_3.csv
└── results
Run snakemake
:
$ snakemake -p
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
18 preprocess
19
rule preprocess:
input: data/sample_1.csv
output: results/sample_1_p1_2_p2_a.csv
jobid: 1
wildcards: param2=a, sample=sample_1, param1=2
ls data/sample_1.csv && echo P1: 2, P2: a > results/sample_1_p1_2_p2_a.csv
data/sample_1.csv
Finished job 1.
1 of 19 steps (5%) done
rule preprocess:
input: data/sample_2.csv
output: results/sample_2_p1_2_p2_a.csv
jobid: 2
wildcards: param2=a, sample=sample_2, param1=2
ls data/sample_2.csv && echo P1: 2, P2: a > results/sample_2_p1_2_p2_a.csv
data/sample_2.csv
Finished job 2.
2 of 19 steps (11%) done
...
localrule all:
input: results/sample_1_p1_1_p2_a.csv, results/sample_1_p1_2_p2_a.csv, results/sample_2_p1_1_p2_a.csv, results/sample_2_p1_2_p2_a.csv, results/sample_3_p1_1_p2_a.csv, results/sample_3_p1_2_p2_a.csv, results/sample_1_p1_1_p2_b.csv, results/sample_1_p1_2_p2_b.csv, results/sample_2_p1_1_p2_b.csv, results/sample_2_p1_2_p2_b.csv, results/sample_3_p1_1_p2_b.csv, results/sample_3_p1_2_p2_b.csv, results/sample_1_p1_1_p2_c.csv, results/sample_1_p1_2_p2_c.csv, results/sample_2_p1_1_p2_c.csv, results/sample_2_p1_2_p2_c.csv, results/sample_3_p1_1_p2_c.csv, results/sample_3_p1_2_p2_c.csv
jobid: 0
Finished job 0.
19 of 19 steps (100%) done
Directory structure after running snakemake
:
$ tree . [18:51:12]
.
├── Snakefile
├── data
│ ├── sample_1.csv
│ ├── sample_2.csv
│ └── sample_3.csv
└── results
├── sample_1_p1_1_p2_a.csv
├── sample_1_p1_1_p2_b.csv
├── sample_1_p1_1_p2_c.csv
├── sample_1_p1_2_p2_a.csv
├── sample_1_p1_2_p2_b.csv
├── sample_1_p1_2_p2_c.csv
├── sample_2_p1_1_p2_a.csv
├── sample_2_p1_1_p2_b.csv
├── sample_2_p1_1_p2_c.csv
├── sample_2_p1_2_p2_a.csv
├── sample_2_p1_2_p2_b.csv
├── sample_2_p1_2_p2_c.csv
├── sample_3_p1_1_p2_a.csv
├── sample_3_p1_1_p2_b.csv
├── sample_3_p1_1_p2_c.csv
├── sample_3_p1_2_p2_a.csv
├── sample_3_p1_2_p2_b.csv
└── sample_3_p1_2_p2_c.csv
Sample result:
$ cat results/sample_2_p1_1_p2_a.csv [19:12:36]
P1: 1, P2: a
param
, which does not appear in the string that has to be expanded. How does it behave? – bli