2
votes

I have a set of BAM files that are generated with BWA-MEM and further processed with GATK IndelRealigner etc. I'm preprocessing my BAM files in smaller chunks to speed up the processing. However, I must merge these individual files to one BAM file prior to variant calling which has been a major problem for my Snakemake pipeline.

My input files follow this kind of naming convention

# Sample 1 BAM files
OVCA-1-FRESH-1_S16_L001_realigned.bam
OVCA-1-FRESH-1_S16_L002_realigned.bam
OVCA-1-FRESH-1_S16_L003_realigned.bam
OVCA-1-FRESH-1_S16_L004_realigned.bam

# Sample 2 BAM files
OVCA-2-FRESH-1_S16_L001_realigned.bam
OVCA-2-FRESH-1_S16_L002_realigned.bam
OVCA-2-FRESH-1_S16_L003_realigned.bam
OVCA-2-FRESH-1_S16_L004_realigned.bam

And problematic pipeline is something like this:

# Map start input files
RUN_ID, LINE = glob_wildcards('{run_id}_L{line}_realigned.bam')

rule all:
   input:
      expand('{run_id}_realigned.bam', run_id=RUN_ID)

# Map input files for merging. This function should collect all
# BAM files that match the {run_id} wildcard.
def samtools_merge_inputs(wildcards):
   files = expand('{run_id}_L{line}_realigned.bam', run_id=RUN_ID, line=LINES)
   return files

# Perform BAM merging.
rule samtools_merge:
   input:
      samtools_merge_inputs
   output:
      '{run_id}_realigned.bam
   shell:
      'samtools merge -h {input} {output}'

I have tried to build a input function that collects all available input files that match the currently processed wildcard. When I perform a dryrun for my pipeline I can see that function samtools_merge_inputs is not working properly as it collects all available BAM files and repeats them multiple times:

rule samtools_merge:
   input:
      OVCA-1-FRESH-1_S16_L001_realigned.bam,
      OVCA-1-FRESH-1_S16_L002_realigned.bam,
      OVCA-1-FRESH-1_S16_L003_realigned.bam,
      OVCA-1-FRESH-1_S16_L004_realigned.bam,
      OVCA-1-FRESH-1_S16_L001_realigned.bam,
      OVCA-1-FRESH-1_S16_L002_realigned.bam,
      OVCA-1-FRESH-1_S16_L003_realigned.bam,
      OVCA-1-FRESH-1_S16_L004_realigned.bam,
      OVCA-1-FRESH-1_S16_L001_realigned.bam,
      OVCA-1-FRESH-1_S16_L002_realigned.bam,
      OVCA-1-FRESH-1_S16_L003_realigned.bam,
      OVCA-1-FRESH-1_S16_L004_realigned.bam,
      OVCA-1-FRESH-1_S16_L001_realigned.bam,
      OVCA-1-FRESH-1_S16_L002_realigned.bam,
      OVCA-1-FRESH-1_S16_L003_realigned.bam,
      OVCA-1-FRESH-1_S16_L004_realigned.bam,
      OVCA-2-FRESH-1_S4_L001_realigned.bam,
      OVCA-2-FRESH-1_S4_L002_realigned.bam,
      OVCA-2-FRESH-1_S4_L003_realigned.bam,
      OVCA-2-FRESH-1_S4_L004_realigned.bam,
      OVCA-2-FRESH-1_S4_L001_realigned.bam,
      OVCA-2-FRESH-1_S4_L002_realigned.bam,
      OVCA-2-FRESH-1_S4_L003_realigned.bam,
      OVCA-2-FRESH-1_S4_L004_realigned.bam,
      OVCA-2-FRESH-1_S4_L001_realigned.bam,
      OVCA-2-FRESH-1_S4_L002_realigned.bam,
      OVCA-2-FRESH-1_S4_L003_realigned.bam,
      OVCA-2-FRESH-1_S4_L004_realigned.bam,
      OVCA-2-FRESH-1_S4_L001_realigned.bam,
      OVCA-2-FRESH-1_S4_L002_realigned.bam,
      OVCA-2-FRESH-1_S4_L003_realigned.bam,
      OVCA-2-FRESH-1_S4_L004_realigned.bam
   output:
      OVCA-1-FRESH-1_S16_realigned.bam
   jobid:
      18
   wildcards:
      run_id=OVCA-1-FRESH-1_S16

It should look like this:

rule samtools_merge:
   input:
      OVCA-1-FRESH-1_S16_L001_realigned.bam,
      OVCA-1-FRESH-1_S16_L002_realigned.bam,
      OVCA-1-FRESH-1_S16_L003_realigned.bam,
      OVCA-1-FRESH-1_S16_L004_realigned.bam
   output:
      OVCA-1-FRESH-1_S16_realigned.bam
   jobid:
      18
   wildcards:
      run_id=OVCA-1-FRESH-1_S16

How should I edit the samtools_merge_inputs function to collect desired input files? I do realize that I could simply forget input function and just type four input files to samtools_merge with wildcards, but I would really like to learn how to use input functions in this kind of cases as I am facing similar kind of problems in my other pipelines as well. I tried to find help from other posts but so far I have not found answers that would suit to my purposes.

Thank you for your help!

1

1 Answers

3
votes

You function does not use the wildcards here. It should be something like this:

def samtools_merge_inputs(wildcards):
    files = expand(wildcards.run_id+'_L{line}_realigned.bam', line=LINES)
    return files

if of course you have all samples on all lanes. When you invoke a function, all wildcards are passed as an object in the wildcards parameter of your function.

You can also do:

files = expand('{run_id}_L{line}_realigned.bam', run_id=wildcards.run_id, line=LINES)  

You have multiple things that will not work in your snakefile.
First, you're missing a " ' " in your samtools merge rule:

rule samtools_merge:
    input:
        samtools_merge_inputs
    output:
        '{run_id}_realigned.bam'<-----
    shell:
        'samtools merge -h {input} {output}'

and watch out for the variable names (LINE vs LINES)

Second, the function glob_wildcards() will return a list of all values found which means that your two variables will be as follow:

RUN_ID, LINES = glob_wildcards('{run_id}_L{line}_realigned.bam')

print(RUN_ID)
['OVCA-2-FRESH-1_S16', 'OVCA-2-FRESH-1_S16', 'OVCA-1-FRESH-1_S16', 'OVCA-1-FRESH-1_S16', 'OVCA-1-FRESH-1_S16', 'OVCA-1-FRESH-1_S16', 'OVCA-2-FRESH-1_S16', 'OVCA-2-FRESH-1_S16']

print(LINES)
['002', '001', '001', '002', '004', '003', '003', '004']

which I'm sure is not what you want. The solution is to have the right structure to describe your samples. For example (again if all samples are on all lanes):

RUN_ID = ["OVCA-1-FRESH-1_S16","OVCA-2-FRESH-1_S16"]
LINES = ["1","2","3","4"]

One last thing: your inputs and outputs cannot be distinguished by the wildcards, meaning that you'll end up with an error Cyclic dependency on rule samtools_merge or RecursionError: maximum recursion depth exceeded in comparison. I suggest you choose a different name for you output. All put together:

# Map start input files
RUN_ID = ["OVCA-1-FRESH-1_S16","OVCA-2-FRESH-1_S16"]
LINES = ["001","002","003","004"]

rule all:
   input:
      expand('{run_id}_realignedFinal.bam', run_id=RUN_ID)

# Map input files for merging. This function should collect all
# BAM files that match the {run_id} wildcard.
def samtools_merge_inputs(wildcards):
   files = expand('{run_id}_L{line}_realigned.bam', run_id=wildcards.run_id, line=LINES)
   return files

# Perform BAM merging.
rule samtools_merge:
   input:
      samtools_merge_inputs
   output:
      '{run_id}_realignedFinal.bam'
   shell:
      'samtools merge -h {input} {output}'

Haven't check your shell command but my doc says:
Usage: samtools merge [-nurlf] [-h inh.sam] [-b <bamlist.fofn>] <out.bam> <in1.bam> [<in2.bam> ... <inN.bam>]