How to properly use wildcards in input and output

Question

I recently decided to start with snakemake. I can't find anything that fits my needs neither on stack, nor on the snakemake doc. I feel like I don't understand something and I may need some explanations.

I am trying to make a simple snakemake workflow that take as input, a fastq file and a sequencing-summary file (that contains infos about the reads) and filter the reads within the fast into several file (low.fastq and high.fastq).

My input data and my Snakefile I'm trying to execute are stored like this :

.
├── data
│   ├── sequencing-summary-example.txt 
│   └── tiny-example.fastq 
├── Snakefile
└── split_fastq

And this is what I've tried so far :

*imports*
rule targets:
    input:
        "split_fastq/low.fastq",
        "split_fastq/high.fastq"

rule split_fastq:
    input:
        "data/{reads}.fastq",
        "data/{seqsum}.txt"
    output:
        "split_fastq/low.fastq",
        "split_fastq/high.fastq"
    run:
        * do the thing *

I expected to have a directory "split_fastq" filled with a "low" and a "high" fastq. But instead I got the error :

Building DAG of jobs...
WildcardError in line 10 of /work/sbsuser/test/roxane/alignement-ont/Snakefile:
Wildcards in input files cannot be determined from output files:
'reads'

Even though it seems to be a very popular error, I'm not sure if I don't understand how to use wildcards or if there is an other problem. Am I using the "input" and "output" correctly ?

I answered a very similar question recently: stackoverflow.com/questions/57327210/… — Maarten-vd-Sande
Yes, I have been reading this thread, but I was puzzled by the fact you had to write in the stone the name of your file into the workflow. I want to use this workflow for any fastq, I don't wanted to write a path or a file name, I would like something dynamic, a bit like an -f myFastq option for example. But maybe snakenamke isn't meant for that and is more "specific dataset oriented" ? Maybe thats what I am not understanding correctly ? — Roxane
There are (at least) two options. The first is that the output has the same wildcards as the input. Then you can run something like snakemake split_fastq/low-example.fastq. The other is that you use the globbing approach in the link, which then will find the relevant samples, and decides which output needs to be generated. — Maarten-vd-Sande
Alright, so to replace in my example, I would write on the output section split_fastq/{reads}-low.fastq","split_fastq/{reads}high.fastq" is that correct ? — Roxane
Kinda, you also need a wildcard for the summary.txt, and each output needs both wildcards (reads, and seqsum). Take a look again at the docs how wildcards work. I unfortunately do not have time to type out an answer. — Maarten-vd-Sande

Colin Colin · Accepted Answer · 2019-08-27T21:56:38

The problem is that you have the wildcard in the input, but not in the output. Wildcards are required in the output. Think about it this way, by putting the wildcard in the input, you're creating a rule that you are intending to be run individually on many different fastq files. But the output files for that rule will be exactly the same file for each of those different fastq files. They'll overwrite each other! You want to incorporate the wildcard into your output files so you get a unique file for each possible input, for example:

rule split_fastq:
    input:
        "data/{reads}.fastq",
        "data/{seqsum}.txt"
    output:
        "split_fastq/{reads}.low.fastq",
        "split_fastq/{reads}.high.fastq"
    run:
        * do the thing *

Now with tiny-example.fastq as your input, you'll get tiny-example.low.fastq and tiny-example.high.fastq as output. And if you add a second fastq file, you'll get different high and low output files for that one. But this rule still won't work because the "seqsum" wildcard is also not part of the output. What you'll probably want to do in this case is have the sequence-summary-example.txt incorporate the name of the fastq file, for example call it sequence-summary-tiny-example.txt. Now you can make your rule like this:

rule split_fastq:
    input:
        "data/{reads}.fastq",
        "data/sequence-summary-{reads}.txt"
    output:
        "split_fastq/{reads}.low.fastq",
        "split_fastq/{reads}.high.fastq"
    run:
        * do the thing *

And now if you then add an other-example.fastq and sequence-summary-other-example.txt, your snakemake pipeline should be able to create other-example.low.fastq and other-example.high.fastq.

Snakemake always thinks backwards from how we tend to think. We first think about the input, and then what output it creates. But Snakemake knows what file it needs to make, and it's trying to figure out what input it needs to make it. So in your original rule, it knew it needed to make low.fastq, and it saw that the split_fastq rule could make that, but then it didn't know what the wildcard "reads" in the input should be. Now, in the new rule, it knows it needs to make tiny-example.low.fastq and sees that split_fastq can create output files of the template {reads}.low.fastq, so it says "Hey, if I make reads = tiny-example, then I can use this rule!" and then it looks at the input and says "Ok, since for input I need {reads}.fastq and I know reads = tiny-example then that means for input I need tiny-example.fastq, and I have that!"

How to properly use wildcards in input and output

1 Answers