0
votes

I want to connect the realigner process with the indel reallignement. This is the rules:

rule gatk_IndelRealigner:
    input:
        tumor="mapped_reads/merged_samples/{tumor}.sorted.dup.reca.bam",
        normal="mapped_reads/merged_samples/{normal}.sorted.dup.reca.bam",
        id="mapped_reads/merged_samples/operation/{tumor}_{normal}.realign.intervals"

    output:
        "mapped_reads/merged_sample/CoClean/{tumor}.sorted.dup.reca.cleaned.bam",
        "mapped_reads/merged_sample/CoClean/{normal}.sorted.dup.reca.cleaned.bam",
    params:
        genome=config['reference']['genome_fasta'],
        mills= config['mills'],
        ph1_indels= config['know_phy'],
    log:
        "mapped_reads/merged_samples/logs/{tumor}.indel_realign_2.log"
    threads: 8
    shell:
        "gatk -T IndelRealigner -R {params.genome}  "
        "-nt {threads} "
        "-I {input.tumor} -I {input.normal}  -known {params.ph1_indels} -known {params.mills} -nWayOut  .cleaned.bam --maxReadsInMemory 500000 --noOriginalAligmentTags --targetIntervals {input.id} >& {log}  "

This is the error:

Not all output files of rule gatk_IndelRealigner contain the same wildcards.

I suppose I need to use also the {tumor}_{normal} but I can't use. Snakemake:

rule all:
      input:expand("mapped_reads/merged_samples/CoClean/{sample}.sorted.dup.reca.cleaned.bam",sample=config['samples']),
           expand("mapped_reads/merged_samples/operation/{sample[1][tumor]}_{sample[1][normal]}.realign.intervals", sample=read_table(config["conditions"], ",").iterrows())

config.yml

conditions: "conditions.csv"

conditions.csv

tumor,normal
411,412

Here you can see an example of the code (for testing purpose) gave the same error:

directory

$ tree  prova/
prova/
├── condition.csv
├── config.yaml
├── output
│   ├── ABC.bam
│   ├── pippa.bam
│   ├── Pippo.bam
│   ├── TimBorn.bam
│   ├── TimNorm.bam
│   ├── TimTum.bam
│   └── XYZ.bam
└── Snakefile

this is snakemake

$ cat prova/Snakefile 
from pandas import read_table

configfile: "config.yaml"

rule all:
    input:
         expand("{pathDIR}/{sample[1][tumor]}_{sample[1][normal]}.bam", pathDIR=config["pathDIR"], sample=read_table(config["sampleFILE"], " ").iterrows()),
         expand("CoClean/{sample[1][tumor]}.bam",  sample=read_table(config["sampleFILE"], " ").iterrows()),
         expand("CoClean/{sample[1][normal]}.bam", sample=read_table(config["sampleFILE"], " ").iterrows())

rule gatk_RealignerTargetCreator:
     input:
         "{pathGRTC}/{normal}.bam",
         "{pathGRTC}/{tumor}.bam",
     output:
         "{pathGRTC}/{tumor}_{normal}.bam"
 #    wildcard_constraints:
 #        tumor = '[^_|-|\/][0-9a-zA-Z]*',
 #        normal = '[^_|-|\/][0-9a-zA-Z]*'
     run:
         call('touch ' + str(wildcard.tumor) + '_' + str(wildcard.normal) + '.bam', shell=True)



rule gatk_IndelRealigner:
    input:
        t1="output/{tumor}.bam",
        n1="output/{normal}.bam",

    output:
        "CoClean/{tumor}.sorted.dup.reca.cleaned.bam",
        "CoClean/{normal}.sorted.dup.reca.cleaned.bam",
    log:
        "mapped_reads/merged_samples/logs/{tumor}.indel_realign_2.log"
    threads: 8
    shell:
        "gatk -T IndelRealigner -R {params.genome}  "
        "-nt {threads} -I {input.t1} -I {input.n1}  & {log}  "

conditions.csv

$ more condition.csv 
tumor normal
TimTum TimBorn
XYZ ABC
Pippo pippa

Thanks for any suggestion

1

1 Answers

1
votes

I'm not convinced you have to include two input files to the GATK IndelRealigner. Building from that assumption, you can alter the rule to become indifferent to the "type (tumor vs normal)" of file it is process. I read the specs here. Please, if I am wrong, stop reading and correct me.

rule gatk_IndelRealigner:
    input:
        inputBAM="output/{sampleGATKIR}.bam",

    output:
        "CoClean/{sampleGATKIR}.sorted.dup.reca.cleaned.bam",
    log:
        "mapped_reads/merged_samples/logs/{sampleGATKIR}.indel_realign_2.log"
    params:
        genome="**DONT FORGET TO ADD THIS""
    threads: 8
    shell:
        "gatk -T IndelRealigner -R {params.genome}  "
        "-nt {threads} -I {input.inputBAM}   & {log}  "

By changing the rule to be bam-type agnostic (made up word) you gain two advantages, and there is one main disadvantage.

Advantages:

  1. Now we only have a single wild-card

  2. We can run the alignment of each .bam file independently, which with a devoted CPU should hopefully make things faster.

Disadvantage:

  1. We are now likely putting two copies of the genome onto memory somewhere, since the threads are now being run as separate processes, no more memory sharing of the genome file. (In my previous position, hardware availability wasn't typically an issue, so I heavily am biased towards splitting everything up)

The reason I think that the GATK documentation has it setup to accept multiple 'bam' files is because if you are just using it as a 1-off call you want to list all the files at the same time. We are not needing that since we are automating the call process. We're indifferent to 1 call or 100 calls.