Specify input and output files in Snakefile

Question

I'm new to Snakemake and I want to make a pipeline that takes a given input text file and concatenates its content to a given output file. However I want to be able to specify the names of both the input and output files at run time, so neither file names are hardcoded in the Snakefile. Right now all I can come up with is:

rule all:
        input:
                "{input}.txt",
                "{output}.txt"

rule output_files:
        input:
                "{input}.txt"
        output:
                "{output}.txt"
        shell:
                "cat {input}.txt > {output}.txt"

I tried running this with "snakemake input1.txt output.txt" but I got the error:

Building DAG of jobs... WildcardError in line 6 of Snakefile: Wildcards in input files cannot be determined from output files: 'input'

Any suggestions would be greatly appreciated.

Dmitry Kuzminov Dmitry Kuzminov · Accepted Answer · 2020-06-26T04:36:42

In your example you actually copy a single input file into an output file using a cat shell command. That could be understood as an intention to concatenate several inputs into one output:

rule concatenate:
    input:
        "input1.txt",
        "input2.txt"
    output:
        "output.txt"
    shell:
        "cat {input} > {output}"

takes a given input text file and concatenates its content to a given output file

Another way to understand the question is that you are trying to append an input file to the end of the output. That is more challenging: Snakemake "thinks" in terms of goals where each goal is a distinct file. How would Snakemake know if the output file is a raw one or if it is a concatenated version? One way to do that is to have "flag" files: the presence of such file would mean that the goal is achieved and no concatenation is needed. One more problem: Snakemake clears the output file before running the rule. Than means that you need to specify it as a input:

rule append:
    input:
        in = "input.txt",
        out = "output.txt"
    output:
        flag = "flag"
    shell:
        "cat {input.in} >> {input.out} && touch {output.flag}"

Now back to your question regarding the error and the way to specify the filenames in runtime. You get this error because the wildcards should be fully inferred from the output section, and both your rules are ill-formed. Let's start with the rule all.

You need to say Snakemake what goal you are building. No wildcards in the input, everything should be disambigued:

def getInput():
    pass
    # form the actual goal (you may query the database, service, hardcode, etc.)

rule all:
    input: getInput

Let's say you decided that the goal should be 3 files: ["output1.txt", "output3.txt", "output3.txt"]:

def getInput():
    magic_numbers_from_oracle = ["1", "2", "3"]
    return magic_numbers_from_oracle

rule all:
    input: expand("output{number}.txt", number=getInput())

Ok, now Snakemake knows the goal. The next step is to write a rule that says how to create a single output{number}.txt file. For simplicity I'm taking your initial approach with cat/copying:

rule cat_copy:
    input:
        "input{n}.txt"
    output:
        "output{n}.txt"
    shell:
        "cat {input} > {output}"

That's it. As long as you have files input1.txt, input2.txt, input3.txt you would get the corresponding outputs.

Specify input and output files in Snakefile

1 Answers