1
votes

I'm using a snakemake pipeline to run the GATK MarkDuplicate command with multiple input bam files from different read groups.

rule mark_duplicates:
    input:
        get_dedup_input
    output:
        bam=temp("bams/{patient}.{sample_type}.markdups.bam"),
        md5=temp("bams/{patient}.{sample_type}.markdups.bam.md5"),
        metrics="qc/gatk/{patient}_{sample_type}_dup_metrics.txt"
    conda:
        "../envs/gatk.yml"
    shell:
        """
        gatk MarkDuplicates -I {input} -O {output.bam} -M {output.metrics} \
            --CREATE_MD5_FILE true --ASSUME_SORT_ORDER "queryname"
        """

get_dedup_input returns a list of input bam files. MarkDuplicates requires each input file is specified with the -I flag. If I only had one bam file, I could simply write -I {input}, but this fails because this specifies -I file1.bam file2.bam it needs to be -I file1.bam -I file2.bam. What's the best way to format the input so that each input file is specified as -I [input file]?

Here are two scenarios below to clarify what the input, output, and shell commands would look like if I were to run the command manually. I've omitted some of the non-essential MarkDuplicate flags for brevity:

One Read Group

Inputs: patient101.normal.rg1.bam
Output: patient101.normal.markdups.bam
Shell: 
gatk MarkDuplicates -I patient101.normal.rg1.bam \
-O patient101.normal.markdups.bam \
-M metrics.txt

Two Read Groups

Inputs: patient101.normal.rg1.bam, patient101.normal.rg2.bam
Output: patient101.normal.markdups.bam
Shell: 
gatk MarkDuplicates -I patient101.normal.rg1.bam \
-I patient101.normal.rg2.bam \
-O patient101.normal.markdups.bam \
-M metrics.txt
1
Can you give us an example of which input and which output you expect from this rule?Maarten-vd-Sande
Post updated to show an example with the input and output files and the way the command should be run!Tomas Bencomo
I'm in the train right now, but when I arrive I'll post an answer(if someone else didn't already :) )Maarten-vd-Sande

1 Answers

1
votes

Thanks for updating your answer.

So probably the best thing to do is to is make a parameter that becomes the string we need, like so:

rule mark_duplicates:
    input:
        get_dedup_input
    output:
        bam=temp("bams/{patient}.{sample_type}.markdups.bam"),
        md5=temp("bams/{patient}.{sample_type}.markdups.bam.md5"),
        metrics="qc/gatk/{patient}_{sample_type}_dup_metrics.txt"
    conda:
        "../envs/gatk.yml"
    params:
        input=lambda wildcards, input: " -I ".join(input)
    shell:
        """
        gatk MarkDuplicates -I {params.input} -O {output.bam} -M {output.metrics} \
            --CREATE_MD5_FILE true --ASSUME_SORT_ORDER "queryname"
        """

If our input is a single bam, params.input will just be patient101.normal.rg1.bam and the -I is added in front as normal.

If we have two input bams, our lambda function puts the -I in between them and the shell command adds the -I in front.