I'm using a snakemake
pipeline to run the GATK MarkDuplicate
command with multiple input bam files from different read groups.
rule mark_duplicates:
input:
get_dedup_input
output:
bam=temp("bams/{patient}.{sample_type}.markdups.bam"),
md5=temp("bams/{patient}.{sample_type}.markdups.bam.md5"),
metrics="qc/gatk/{patient}_{sample_type}_dup_metrics.txt"
conda:
"../envs/gatk.yml"
shell:
"""
gatk MarkDuplicates -I {input} -O {output.bam} -M {output.metrics} \
--CREATE_MD5_FILE true --ASSUME_SORT_ORDER "queryname"
"""
get_dedup_input
returns a list of input bam files. MarkDuplicates
requires each input file is specified with the -I
flag. If I only had one bam file, I could simply write -I {input}
,
but this fails because this specifies -I file1.bam file2.bam
it needs to be -I file1.bam -I file2.bam
. What's the best way to format the input so that each input file is specified as -I [input file]
?
Here are two scenarios below to clarify what the input, output, and shell commands would look like if I were to run the command manually. I've omitted some of the non-essential MarkDuplicate
flags for brevity:
One Read Group
Inputs: patient101.normal.rg1.bam
Output: patient101.normal.markdups.bam
Shell:
gatk MarkDuplicates -I patient101.normal.rg1.bam \
-O patient101.normal.markdups.bam \
-M metrics.txt
Two Read Groups
Inputs: patient101.normal.rg1.bam, patient101.normal.rg2.bam
Output: patient101.normal.markdups.bam
Shell:
gatk MarkDuplicates -I patient101.normal.rg1.bam \
-I patient101.normal.rg2.bam \
-O patient101.normal.markdups.bam \
-M metrics.txt