1
votes

Unexperienced, self-tought "coder" here, so please be understanding :]

I am trying to learn and use Snakemake to construct pipeline for my analysis. Unfortunatly, I am unable to run multiple instances of a single job/rule at the same time. My workstation is not a computing cluster, so I cannot use this option. I looked for an answer for hours, but either there is non, or I am not knowledgable enough to understand it. So: is there a way to run multiple instances of a single job/rule simultaneously?

If You would like a concrete example:

Lets say I want to analyze a set of 4 .fastq files using fastqc tool. So I input a command:

time snakemake -j 32

and thus run my code, which is:

SAMPLES, = glob_wildcards("{x}.fastq.gz")

rule Raw_Fastqc:
    input:
            expand("{x}.fastq.gz", x=SAMPLES)
    output:
            expand("./{x}_fastqc.zip", x=SAMPLES),
            expand("./{x}_fastqc.html", x=SAMPLES)
    shell:
            "fastqc {input}"

I would expect snakemake to run as many instances of fastqc as possible on 32 threads (so easily all of my 4 input files at once). In reality. this command takes about 12 minutes to finish. Meanwhile, utilizing GNU parallel from inside snakemake

shell:
    "parallel fastqc ::: {input}"

I get results in 3 minutes. Clearly there is some untapped potential here.

Thanks!

2
Similar issue here: stackoverflow.com/q/50828233/1878788. This seems a common pitfall. - bli
Yes, I saw this topic, but incorrectly I thought that my problem is different because I do not use computing clusters. Hence replicate question. Cheers! - AdrianS85

2 Answers

3
votes

If I am not wrong, fastqc works on each fastq file separately, and therefore your implementation doesn't take advantage of parallelization feature of snakemake. This can be done by defining the targets as shown below using rule all.

from pathlib import Path

SAMPLES = [Path(f).name.replace('.fastq.gz', '')  for f in glob_wildcards("{x}.fastq.gz") ]

rule all:
    input:
        expand("./{sample_name}_fastqc.{ext}", 
                        sample_name=SAMPLES, ext=['zip', 'html'])

rule Raw_Fastqc:
    input:
            "{x}.fastq.gz", x=SAMPLES
    output:
            "./{x}_fastqc.zip", x=SAMPLES,
            "./{x}_fastqc.html", x=SAMPLES
    shell:
            "fastqc {input}"
1
votes

To add to JeeYem's answer above, you can also define the number of resources to reserve for each job using the 'threads' property of each rule, as so:

rule Raw_Fastqc:
input:
        "{x}.fastq.gz", x=SAMPLES
output:
        "./{x}_fastqc.zip", x=SAMPLES,
        "./{x}_fastqc.html", x=SAMPLES
threads: 4
shell:
        "fastqc --threads {threads} {input}"

Because fastqc itself can use multiple threads per task, you might even get additional speedups over the parallel implementation.

Snakemake will then automatically allocate as many jobs as can fit within the total threads provided by the top-level call:

snakemake -j 32, for example, would execute up to 8 instances of the Raw_Fastqc rule.