Running parallel instances of a single job/rule on Snakemake

Question

Unexperienced, self-tought "coder" here, so please be understanding :]

I am trying to learn and use Snakemake to construct pipeline for my analysis. Unfortunatly, I am unable to run multiple instances of a single job/rule at the same time. My workstation is not a computing cluster, so I cannot use this option. I looked for an answer for hours, but either there is non, or I am not knowledgable enough to understand it. So: is there a way to run multiple instances of a single job/rule simultaneously?

If You would like a concrete example:

Lets say I want to analyze a set of 4 .fastq files using fastqc tool. So I input a command:

time snakemake -j 32

and thus run my code, which is:

SAMPLES, = glob_wildcards("{x}.fastq.gz")

rule Raw_Fastqc:
    input:
            expand("{x}.fastq.gz", x=SAMPLES)
    output:
            expand("./{x}_fastqc.zip", x=SAMPLES),
            expand("./{x}_fastqc.html", x=SAMPLES)
    shell:
            "fastqc {input}"

I would expect snakemake to run as many instances of fastqc as possible on 32 threads (so easily all of my 4 input files at once). In reality. this command takes about 12 minutes to finish. Meanwhile, utilizing GNU parallel from inside snakemake

shell:
    "parallel fastqc ::: {input}"

I get results in 3 minutes. Clearly there is some untapped potential here.

Thanks!

Similar issue here: stackoverflow.com/q/50828233/1878788. This seems a common pitfall. — bli
Yes, I saw this topic, but incorrectly I thought that my problem is different because I do not use computing clusters. Hence replicate question. Cheers! — AdrianS85

Manavalan Gajapathy Manavalan Gajapathy · Accepted Answer · 2018-06-25T16:24:59

If I am not wrong, fastqc works on each fastq file separately, and therefore your implementation doesn't take advantage of parallelization feature of snakemake. This can be done by defining the targets as shown below using rule all.

from pathlib import Path

SAMPLES = [Path(f).name.replace('.fastq.gz', '')  for f in glob_wildcards("{x}.fastq.gz") ]

rule all:
    input:
        expand("./{sample_name}_fastqc.{ext}", 
                        sample_name=SAMPLES, ext=['zip', 'html'])

rule Raw_Fastqc:
    input:
            "{x}.fastq.gz", x=SAMPLES
    output:
            "./{x}_fastqc.zip", x=SAMPLES,
            "./{x}_fastqc.html", x=SAMPLES
    shell:
            "fastqc {input}"

Running parallel instances of a single job/rule on Snakemake

2 Answers