1
votes

I want to use snakemake to run one instance of a first rule which takes an input file and creates multiple output files. I then want to take each output file as inputs for a second rule. I only want to run one instance of the first rule to avoid unecessary duplicating of this rule as only one should be needed to create the outputs.

Here's an oversimplified example:

Say I have an input file, samplenames.txt containing the following:

sample1
sample2

I want to take the filenames from this file and make a file with the same name for each. I then want to make a copy of each with the following final output files:

sample1_copy
sample2_copy

My Snakefile contains the following:

SAMPLES = [1,2]

rule all:
    input:
        expand(
            "sample{sample}_copy",
            sample=SAMPLES
        )

rule fetch_filenames:
    input:
        "samplenames.txt"
    output:
        "sample{sample}"
    shell:
        "while IFS= read -r line; do touch $line; done < {input}"

rule copy_files:
    input:
        expand(
            "sample{sample}", 
            sample=SAMPLES
        )
    output:
        expand(
            "sample{sample}_copy", 
            sample=SAMPLES
        )
    shell:
        "touch {output}"

This does the job, but two instances of the first rule are completed when only one is needed. When I apply this to many more files in a more complex workflow it results in many unnecessary instances. Is there a way of running of only running one instance of the first rule?

I have tried the following for the first rule:

rule fetch_filenames:
    input:
        "samplenames.txt"
    output:
        "sample1"
    shell:
        "while IFS= read -r line; do touch $line; done < {input}"

However this results in the following error: "Missing input files for rule copy_files: sample2"

I am sad. Any help make me many happys.

1

1 Answers

2
votes

If you want fetch_filenames to produce all the output files in one execution you should list all the required output files in output directive. E.g.:

rule fetch_filenames:
    input:
        "samplenames.txt"
    output:
        expand("sample{sample}", sample= SAMPLES),
    shell:
        ...

On the contrary, if you want copy_files to be execute once for each input/output pair, remove the expand functions:

rule copy_files:
    input:
        "sample{sample}",
    output:
        "sample{sample}_copy",
    shell:
        ...