Single output file per group for grouped samples using Snakemake

Question

I'm trying to write a snakemake pipeline to process samples that are divided into groups which requires a single summary file per group. For examples, samples are divided as follows:

Group 1:
    Sample 1
    Sample 2
Group 2: 
    Sample 3 
    Sample 4

Each sample is process using bedtools to generate an output file per sample. Next, I need to summarize at the group level for each Group's collection of samples. An abbreviated snakemake file looks like this:

 rule intersect:
     input:
         bam = join('{group}','{sample}.bam'),
         reg_bed = join('{group}', 'region.bed')
     output:
         reg_intersect = join('{group}', '{sample}.intersect.bed')
    shell:
        'bedtools intersect -wa -wb -a {input.reg_bed} -b {input.bam} > {output.reg_intersect}'

rule aggregate:
    input:
        rules.interesect.output
    output:
        join('{group}','summary_stats.csv')
    shell:
        touch(join('{group}','summary_stats.csv'))
        #this would be a call to a python function that operates on all the output files to generate a single file

I get complaints that wildcards not agreeing between input and output (input contains {group} and {sample} while output only has {group}. I've tried using expand() but I have to include values for both group and sample, and sample isn't possible because it is dependent upon group.

Any suggestions are welcome.

Rule aggregate uses as input the output of rule intersect. Rule intersect has two wildcards (group and sample), so rule aggregate expects those two as well. However you only specify group as wildcard. You should make an input function that uses the wildcard group to determine which samples are required. — Maarten-vd-Sande

Eric C. Eric C. · Accepted Answer · 2019-08-14T15:10:28

Like @Maarten-vd-Sande said, the best is to use an input function to get all samples for a group in your aggregate rule.

samplesInGroups = {"group1":["sample1","sample2"],"group2":["sample3","sample4"]}

def getSamplesInGroup(wildcards):
    return [join(wildcards.group,s+".intersect.bed") for s in samplesInGroups[wildcards.group]]

rule all:
     input:  expand("{group}/summary_stats.csv",group=list(samplesInGroups.keys()))

rule intersect:
     input:
         bam = join('{group}','{sample}.bam'),
         reg_bed = join('{group}', 'region.bed')
     output:
         reg_intersect = join('{group}', '{sample}.intersect.bed')
     shell:
        'bedtools intersect -wa -wb -a {input.reg_bed} -b {input.bam} > {output.reg_intersect}'

rule aggregate:
     input:
        getSamplesInGroup
     output:
        join('{group}','summary_stats.csv')
     shell:
        "python ... {input}"

Single output file per group for grouped samples using Snakemake

1 Answers