Snakemake show number of files per job

Question

I'm relatively new with Snakemake, and I'm having some troubles figuring out how to counts the number of jobs per rule. The snakefile I am using is below

rule test:
    input:
        files = expand("{file}", file=glob.glob("/home/MyData/input/*.csv"))
    output:
        out = expand("{file}", file=glob.glob("/home/MyData/output/*.csv"))
    run:
        with open(output.out, 'r') as input_stream:
            for file in input_stream:
                print(file)

The Jobs count shows the following (when ran with snakemake -j 4 test -n)

Job counts:
    count   jobs
    1   test
    1

However, going through a snakemake tutorial I found online (link here), his snakefile looks like this:

configfile: "config.yaml"

rule all:
    input:
        "plots/quals.svg",
        "calls/all.vcf",
        "mapped/",
        "mapped/"

rule map_reads:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        pipe("mapped/{sample}.bam")
    conda:
        "envs/mapping.yaml"
    shell:
        "bwa mem {input} | samtools view -Sb > {output}"

rule sort:
    input:
        "mapped/{sample}.bam"
    output:
        "mapped/{sample}.sorted.bam"
    conda:
        "envs/mapping.yaml"
    shell:
        "samtools sort -o {output} {input}"

rule call:
    input:
        genome="data/genome.fa",
        bam=expand("mapped/{sample}.sorted.bam", sample=config["samples"])
    output:
        "calls/all.vcf"
    conda:
        "envs/calling.yaml"
    shell:
        "samtools mpileup -g -f {input.genome} {input.bam} | "
        "bcftools call -mv - > {output}"

rule plot_qual:
    input:
        "calls/all.vcf"
    output:
        svg=report("plots/quals.svg", caption="report/plot-quals.rst")
    conda:
        "envs/stats.yaml"
    script:
        "scripts/plot-quals.py"

And the Job counts looks like this (when run with snakemake -j 4 all -n)

Job counts:
    count   jobs
    1   all
    1   call
    3   map_reads
    1   plot_qual
    3   sort
    9

With the config.yaml file looking like:

samples:
  - A
  - B
  - C

How can I get my Job counts to show the number of input files run per rule?

One job corresponds to one rule instance. Here, your test rule has only one "instance" because it deals with multiple files by itself. If you want several "instances" of a rule to happen, you should make a rule that deals with one file, and another one that wants as input multiple files. Then snakemake will figure out that it needs to run the first one multiple times in order to produce the input that the other wants, and make as many jobs. — bli
I'm surprised that open(output.out, 'r') works, given that output.out is a list (it is generated by expand, so it is a list.). I wrote some explanations about expand here: stackoverflow.com/a/50216057/1878788 You may be interested at other examples here stackoverflow.com/a/50837428/1878788 and here stackoverflow.com/a/44945591/1878788 — bli
@bli Ahh this makes much more sense, I hadn't realized that instances of rules were being created. Thank you for this! I will check out the expand examples as well. — user12906021

Dmitry Kuzminov Dmitry Kuzminov · Accepted Answer · 2020-06-22T18:40:19

What the Snakemake shows in the Job counts table is actually the number of times a particular rule is being invoked. In your case there is a single rule test:, and it is being invoked 1 time (with total number of rules equal to 1):

Job counts:
    count   jobs
    1   test
    1

In the other example you have 9 rule invocations in total with the following dropdown:

Job counts:
    count   jobs
    1   all
    1   call
    3   map_reads
    1   plot_qual
    3   sort
    9

As for how to show the number of files per job, you have to run the whole pipeline, as these numbers may differ, and are evaluated in runtime. The Snakemake log shows the input/output of each job being executed.

Snakemake show number of files per job

1 Answers