3
votes

I am just getting started with snakemake and was wondering what's the "correct" way to run a set of parameters on the same file and how this will work for chaining of rules?

So for example, when I want to have multiple normalization methods, followed by let's say a clustering rule with a varying number of k clusters. What would be the best way to do this such that all combinations are run?

I started doing this:

INFILES = ["mytable"]

rule preprocess:
input:
    bam=expand("data/{sample}.csv", sample=INFILES, param=config["normmethod"])

output:
    bamo=expand("results/{sample}_pp_{param}.csv", sample=INFILES, param=config["normmethod"])

script:
    "scripts/preprocess.py"

And then invoked the script via:

snakemake --config normmethod=Median

But that doesn't really scale for further options later in the workflow. For example, how would I incorporate these set of options automatically?

normmethods= ["Median", "Quantile"]
kclusters= [1,3,5,7,10]
3
The expand in your input contains param, which does not appear in the string that has to be expanded. How does it behave?bli
The code above runs through. If I do 'snakemake --config normmethod=Median' the "Median" method is used. If I run the workflow with 'snakemake --config normmethod=Mean', the mean is used. Accordingly, the output files carry the "normmethod param" in their filename.Kam Sen

3 Answers

8
votes

You did well using the function expand() in your rule.

For the parameters I recommend to use a configuration file containing all your parameters. Snakemake works with YAML & JSON files. Here you got all the information about these two formats:

In your case you just have in a YAML file to write this :

INFILES : "mytables"

normmethods : ["Median", "Quantile"] 
or
normmethods : - "Median"
              - "Quantile"

kclusters : [1,3,5,7,10]
or
kclusters : - 1
            - 3
            - 5
            - 7
            - 10

Write your rule like this :

rule preprocess:
input:
    bam = expand("data/{sample}.csv",
                 sample = config["INFILES"])

params :
    kcluster = config["kcluster"]

output:
    bamo = expand("results/{sample}_pp_{method}_{cluster}.csv",
                  sample = config["INFILES"],
                  method = config["normmethod"],
                  cluster = config["kcluster"])

script:
    "scripts/preprocess.py {input.bam} {params.kcluster}"

Then you just have to lunch like this :

snakemake --configfile  path/to/config.yml

For running with others parameters you will have to modify your configuration file and not your snakefile (making less mistakes) and it is better for the readability and code beauty.

EDIT :

  rule preprocess:
    input:
      bam = "data/{sample}.csv"

Just to correct my own mistake, you don't need to use expand here on the input since you just want to run the rule one file .csv by one. So just put the wildcard here and Snakemake will do his part.

5
votes

Seems you didn't pass the params to your script. How about something like the following?

import re
import os
import glob
normmethods= ["Median", "Quantile"] # can be set from config['normmethods']    
kclusters= [1,3,5,7,10]             # can be set from config['kclusters']
INFILES = ['results/' + re.sub('\.csv$', '_pp_' + m + '-' + str(k) + '.csv', re.sub('data/', '', file)) for file in glob.glob("data/*.csv") for m in normmethods for k in kclusters]

rule cluster:
    input: INFILES

rule preprocess:
    input:
        bam="data/{sample}.csv"
    output:
        bamo="results/{sample}_pp_{m}-{k}.csv"
    run:     
        os.system("scripts/preprocess.py %s %s %s %s" % (input.bame, output.bamo, wildcards.m, wildcards.k))
1
votes

This answer is similar to @Shiping's answer, which is to use wildcards in the output of a rule to implement multiple parameters per input file. However, this answer provides a more detailed example and avoids using complex list comprehension, regular expression, or the glob module.

@Pereira Hugo's approach uses one job to run all parameter combinations for one input file, whereas the approach in this answer uses one job to run one parameter combination for one input file, which makes it easier to parallelize the execution of each parameter combination on one input file.

Snakefile:

import os

data_dir = 'data'
sample_fns = os.listdir(data_dir)
sample_pfxes = list(map(lambda p: p[:p.rfind('.')],
                        sample_fns))

res_dir = 'results'

params1 = [1, 2]
params2 = ['a', 'b', 'c']

rule all:
    input:
        expand(os.path.join(res_dir, '{sample}_p1_{param1}_p2_{param2}.csv'),
               sample=sample_pfxes, param1=params1, param2=params2)

rule preprocess:
    input:
        csv=os.path.join(data_dir, '{sample}.csv')

    output:
        csv=os.path.join(res_dir, '{sample}_p1_{param1}_p2_{param2}.csv')

    shell:
        "ls {input.csv} && \
           echo P1: {wildcards.param1}, P2: {wildcards.param2} > {output.csv}"

Directory structure before running snakemake:

$ tree .
.
├── Snakefile
├── data
│   ├── sample_1.csv
│   ├── sample_2.csv
│   └── sample_3.csv
└── results

Run snakemake:

$ snakemake -p
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1   all
    18  preprocess
    19

rule preprocess:
    input: data/sample_1.csv
    output: results/sample_1_p1_2_p2_a.csv
    jobid: 1
    wildcards: param2=a, sample=sample_1, param1=2

ls data/sample_1.csv &&          echo P1: 2, P2: a > results/sample_1_p1_2_p2_a.csv
data/sample_1.csv
Finished job 1.
1 of 19 steps (5%) done

rule preprocess:
    input: data/sample_2.csv
    output: results/sample_2_p1_2_p2_a.csv
    jobid: 2
    wildcards: param2=a, sample=sample_2, param1=2

ls data/sample_2.csv &&          echo P1: 2, P2: a > results/sample_2_p1_2_p2_a.csv
data/sample_2.csv
Finished job 2.
2 of 19 steps (11%) done

...

localrule all:
    input: results/sample_1_p1_1_p2_a.csv, results/sample_1_p1_2_p2_a.csv, results/sample_2_p1_1_p2_a.csv, results/sample_2_p1_2_p2_a.csv, results/sample_3_p1_1_p2_a.csv, results/sample_3_p1_2_p2_a.csv, results/sample_1_p1_1_p2_b.csv, results/sample_1_p1_2_p2_b.csv, results/sample_2_p1_1_p2_b.csv, results/sample_2_p1_2_p2_b.csv, results/sample_3_p1_1_p2_b.csv, results/sample_3_p1_2_p2_b.csv, results/sample_1_p1_1_p2_c.csv, results/sample_1_p1_2_p2_c.csv, results/sample_2_p1_1_p2_c.csv, results/sample_2_p1_2_p2_c.csv, results/sample_3_p1_1_p2_c.csv, results/sample_3_p1_2_p2_c.csv
    jobid: 0

Finished job 0.
19 of 19 steps (100%) done

Directory structure after running snakemake:

$ tree .                                                                                                                                       [18:51:12]
.
├── Snakefile
├── data
│   ├── sample_1.csv
│   ├── sample_2.csv
│   └── sample_3.csv
└── results
    ├── sample_1_p1_1_p2_a.csv
    ├── sample_1_p1_1_p2_b.csv
    ├── sample_1_p1_1_p2_c.csv
    ├── sample_1_p1_2_p2_a.csv
    ├── sample_1_p1_2_p2_b.csv
    ├── sample_1_p1_2_p2_c.csv
    ├── sample_2_p1_1_p2_a.csv
    ├── sample_2_p1_1_p2_b.csv
    ├── sample_2_p1_1_p2_c.csv
    ├── sample_2_p1_2_p2_a.csv
    ├── sample_2_p1_2_p2_b.csv
    ├── sample_2_p1_2_p2_c.csv
    ├── sample_3_p1_1_p2_a.csv
    ├── sample_3_p1_1_p2_b.csv
    ├── sample_3_p1_1_p2_c.csv
    ├── sample_3_p1_2_p2_a.csv
    ├── sample_3_p1_2_p2_b.csv
    └── sample_3_p1_2_p2_c.csv

Sample result:

$ cat results/sample_2_p1_1_p2_a.csv                                                                                                          [19:12:36]
P1: 1, P2: a