I have a set of BAM files that are generated with BWA-MEM and further processed with GATK IndelRealigner etc. I'm preprocessing my BAM files in smaller chunks to speed up the processing. However, I must merge these individual files to one BAM file prior to variant calling which has been a major problem for my Snakemake pipeline.
My input files follow this kind of naming convention
# Sample 1 BAM files
OVCA-1-FRESH-1_S16_L001_realigned.bam
OVCA-1-FRESH-1_S16_L002_realigned.bam
OVCA-1-FRESH-1_S16_L003_realigned.bam
OVCA-1-FRESH-1_S16_L004_realigned.bam
# Sample 2 BAM files
OVCA-2-FRESH-1_S16_L001_realigned.bam
OVCA-2-FRESH-1_S16_L002_realigned.bam
OVCA-2-FRESH-1_S16_L003_realigned.bam
OVCA-2-FRESH-1_S16_L004_realigned.bam
And problematic pipeline is something like this:
# Map start input files
RUN_ID, LINE = glob_wildcards('{run_id}_L{line}_realigned.bam')
rule all:
input:
expand('{run_id}_realigned.bam', run_id=RUN_ID)
# Map input files for merging. This function should collect all
# BAM files that match the {run_id} wildcard.
def samtools_merge_inputs(wildcards):
files = expand('{run_id}_L{line}_realigned.bam', run_id=RUN_ID, line=LINES)
return files
# Perform BAM merging.
rule samtools_merge:
input:
samtools_merge_inputs
output:
'{run_id}_realigned.bam
shell:
'samtools merge -h {input} {output}'
I have tried to build a input function that collects all available input files that match the currently processed wildcard. When I perform a dryrun for my pipeline I can see that function samtools_merge_inputs
is not working properly as it collects all available BAM files and repeats them multiple times:
rule samtools_merge:
input:
OVCA-1-FRESH-1_S16_L001_realigned.bam,
OVCA-1-FRESH-1_S16_L002_realigned.bam,
OVCA-1-FRESH-1_S16_L003_realigned.bam,
OVCA-1-FRESH-1_S16_L004_realigned.bam,
OVCA-1-FRESH-1_S16_L001_realigned.bam,
OVCA-1-FRESH-1_S16_L002_realigned.bam,
OVCA-1-FRESH-1_S16_L003_realigned.bam,
OVCA-1-FRESH-1_S16_L004_realigned.bam,
OVCA-1-FRESH-1_S16_L001_realigned.bam,
OVCA-1-FRESH-1_S16_L002_realigned.bam,
OVCA-1-FRESH-1_S16_L003_realigned.bam,
OVCA-1-FRESH-1_S16_L004_realigned.bam,
OVCA-1-FRESH-1_S16_L001_realigned.bam,
OVCA-1-FRESH-1_S16_L002_realigned.bam,
OVCA-1-FRESH-1_S16_L003_realigned.bam,
OVCA-1-FRESH-1_S16_L004_realigned.bam,
OVCA-2-FRESH-1_S4_L001_realigned.bam,
OVCA-2-FRESH-1_S4_L002_realigned.bam,
OVCA-2-FRESH-1_S4_L003_realigned.bam,
OVCA-2-FRESH-1_S4_L004_realigned.bam,
OVCA-2-FRESH-1_S4_L001_realigned.bam,
OVCA-2-FRESH-1_S4_L002_realigned.bam,
OVCA-2-FRESH-1_S4_L003_realigned.bam,
OVCA-2-FRESH-1_S4_L004_realigned.bam,
OVCA-2-FRESH-1_S4_L001_realigned.bam,
OVCA-2-FRESH-1_S4_L002_realigned.bam,
OVCA-2-FRESH-1_S4_L003_realigned.bam,
OVCA-2-FRESH-1_S4_L004_realigned.bam,
OVCA-2-FRESH-1_S4_L001_realigned.bam,
OVCA-2-FRESH-1_S4_L002_realigned.bam,
OVCA-2-FRESH-1_S4_L003_realigned.bam,
OVCA-2-FRESH-1_S4_L004_realigned.bam
output:
OVCA-1-FRESH-1_S16_realigned.bam
jobid:
18
wildcards:
run_id=OVCA-1-FRESH-1_S16
It should look like this:
rule samtools_merge:
input:
OVCA-1-FRESH-1_S16_L001_realigned.bam,
OVCA-1-FRESH-1_S16_L002_realigned.bam,
OVCA-1-FRESH-1_S16_L003_realigned.bam,
OVCA-1-FRESH-1_S16_L004_realigned.bam
output:
OVCA-1-FRESH-1_S16_realigned.bam
jobid:
18
wildcards:
run_id=OVCA-1-FRESH-1_S16
How should I edit the samtools_merge_inputs function to collect desired input files? I do realize that I could simply forget input function and just type four input files to samtools_merge with wildcards, but I would really like to learn how to use input functions in this kind of cases as I am facing similar kind of problems in my other pipelines as well. I tried to find help from other posts but so far I have not found answers that would suit to my purposes.
Thank you for your help!