0
votes

New to snakemake and I've been trying to transform my shell script based pipeline into snakemake based today and run into a lot of syntax issues.. I think most of the trouble I have is around getting all the files in a particular directories and infer output names from input names since that's how I use shell script (for loop), in particular, I tried to use expand function in the output section and it always gave me an error.

After checking some example Snakefile, I realized people never use expand in the output section. So my first question is: is output the only section where expand can't be used and if so, why? What if I want to pass a prefix defined in config.yaml file as part of the output file and that prefix can not be inferred from input file names, how can I achieve that, just like what I did below for the log section where {runid} is my prefix?

Second question about syntax: I tried to pass a user defined id in the configuration file (config.yaml) into the log section and it seems to me that here I have to use expand in the following form, is there a better way of passing strings defined in config.yaml file?

log:    
    expand("fastq/fastqc/{runid}_fastqc_log.txt",runid=config["run"])

where in the config.yaml

run:
    "run123"

Third question: I initially tried the following 2 methods but they gave me errors so does it mean that inside log (probably input and output) section, Python syntax is not followed?

log:
    "fastq/fastqc/"+config["run"]+"_fastqc_log.txt"

log:
    "fastq/fastqc/{config["run"]}_fastqc_log.txt"
1
Hi olala, I feel your question doesn't really stand out of text. Can you please highlight your question in a one or two sentences in the end ?Vasif
@Vasif thanks for the comment. I just edited it.olala

1 Answers

4
votes

Here is an example of small workflow:

# Sample IDs
SAMPLES = ["sample1", "sample2"]
CONTROL = ["sample1"]
TREATMENT = ["sample2"]

rule all:
    input: expand("{treatment}_vs_{control}.bed", treatment=TREATMENT, control=CONTROL)

rule peak_calling:
    input: control="{control}.sam", treatment="{treatment}.sam"
    output: "{treatment}_vs_{control}.bed"
    shell: "touch {output}"

rule mapping:
    input: "{samples}.fastq"
    output: "{samples}.sam"
    shell: "cp {input} {output}"

I used the expand function only in my final target. From there, snakemake can deduce the different values of the wildcards used in the rules "mapping" and "peak_calling".

As for the last part, the right way to put it would be the first one:

log:
    "fastq/fastqc/" + config["run"] + "_fastqc_log.txt"

But again, snakemake can deduce it from your target (the rule all, in my example).

rule mapping:
    input: "{samples}.fastq"
    output: "{samples}.sam"
    log: "{samples}.log"
    shell: "cp {input} {output}"

Hope this helps!