Syntax for merging technical replicates in Snakefile and config file

Question

How can I specify inputs for technical replicates in the snakemake config file, where the goal is aligning paired end reads to a reference genome, and then merging alignment files of replicates into a single file for each individual? My files are named according to the pattern raw-fastq/sampleX_groupX_RX.fq.gz where each sample has multiple groups, and each group has R1 and R2. It sounds like the solution might be to write an input function that creates the list of proper inputs based on the config file, but it doesn't seem like there is a standard way of doing this, and being quite new to python I am not sure how to approach this. In my case there are exactly two groups for each sample, so I tried to use this simple config.yaml:

samples:
  - sample1
  - sample2

groups:
  - group1
  - group2

Here is the workflow I would like to use, or something similar:

configfile: "config.yaml"

rule all:
    input: 
        expand("merged/{sample}.bam", sample=config["samples"])

rule cutadapt:
    input:
        expand(['raw-fastq/{sample}_{group}_R1.fq.gz', 'raw-fastq/{sample}_{group}_R2.fq.gz'], sample=config["samples"], group=config["groups"])
    output:
        read1 = temp("trimmed_reads/{sample}_{group}_R1.fq.gz"),
        read2 = temp("trimmed_reads/{sample}_{group}_R2.fq.gz")
    shell:
        "cutadapt -e 0.2 -O 5 -a AAGTCGGX -A AAGTCGGX -o {output.read1} -p {output.read2} {input.read1} {input.read2}"

rule bwa_mem:
    input:
        expand(['trimmed_reads/{sample}_{group}_R1.fq.gz', 'trimmed_reads/{sample}_{group}_R2.fq.gz'], sample=config["samples"], group=config["groups"])
    output:
        temp("mapped/{sample}_{group}.unsorted.sam")
    params:
        genome = "/path-to-genome"
    log:
        "mapped/log/{sample}_{group}_bwa_mem.log"
    benchmark:
        "benchmarks/bwa_mem/{sample}_{group}.txt"
    threads: 8
    shell:
        "bwa mem -R '@RG\tID:{wildcards.sample}_{wildcards.group}\tSM:{wildcards.sample}' -t {threads} {params.genome} {input} 2> {log} > {output}"

rule samtools_sort:
    input: 
        expand("mapped/{sample}_{group}.unsorted.bam", sample=config["samples"], group=config["groups"]
    output:
        "mapped/{sample}_{group}.sorted.bam"
    shell:
        "samtools sort {input} > {output}"

rule samtools_merge:
    input: 
        expand(['mapped/{sample}_group1.sorted.bam', 'mapped/{sample}_group2.sorted.bam'], sample=config["samples"])
    output:
        protected("merged/{sample}.merged.bam")
    shell:
        "samtools merge {output} {input}"

I would like these rules to generate, for example, the single file merged/sample1.bam from the four input files

raw-fastq/sample1_group1_R1.fq
raw-fastq/sample1_group1_R2.fq
raw-fastq/sample1_group2_R1.fq
raw-fastq/sample1_group2_R2.fq

and likewise for each sample.

Running snakemake -np doesn't throw any errors, but I can tell there is a problem with the way I'm trying to use the expand function: the input is taken to a list of all the possible combinations, rather than iterating through the list of samples and groups. For example, the DAG looks like:

so I can tell there is a problem with the way I'm trying to specify the inputs , but I don't know the proper way of doing this. I would like to know (1) what the proper syntax is for specifying the paired file input in which there are multiple wildcards per filename, in both the config.yaml file and Snakefile, and (b) how to deal with the situation of merging multiple inputs from replicates into a single output file during the samtools_merge rule.

What is the expected solution? What are the expected output filenames for the samples and groups in your config file? I would remove parts of the question that are not relevant anymore. I am choosing to not re-read your question, as I'm not sure which of your original question is still relevant. — Manavalan Gajapathy
@JeeYem sorry if it's unclear. I wrote in the question that I would like to generate a single bam file for each sample, e.g. merged/sample1.bam which is generated by merging mapped/sample1_group1.bam and mapped/sample1_group2.bam. Thanks — Jeff Groh
Please see the updated answer. I forget if stackoverflow notifies for updated answers. — Manavalan Gajapathy

Manavalan Gajapathy Manavalan Gajapathy · Accepted Answer · 2019-03-01T05:12:09

It appears that your yaml config file is not in proper lists format. They need to be hyphenated. Otherwise, based on your example, it looks like it gets read as a string.

samples:
    - sample1
    - sample2

groups:
    - group1
    - group2

In your rule all, wildcard {sample} is not defined. What is it supposed to be? Also, to further clarify your question, can you provide an example - what are the expected output filenames for the samples and groups in your config file?

Updated answer:

Problem here is because each rule was using more files than were needed. For example, rule cutadapt needs to act on only one sample and one group at a time; not all samples and groups. I would recommend walking through snakemake tutorial example for better understanding.

Here is modified code. I haven't tried executing it, but I think this would work.

configfile: "config.yaml"

rule all:
    input: 
        expand("merged/{sample}.bam", sample=config["samples"])

rule cutadapt:
    input:
        expand('raw-fastq/{{sample}}_{{group}}_R{read}.fq.gz', 
                read=[1,2])
    output:
        read1 = temp("trimmed_reads/{sample}_{group}_R1.fq.gz"),
        read2 = temp("trimmed_reads/{sample}_{group}_R2.fq.gz")
    message:
        "Trimming adapters - {wildcards.sample}, {wildcards.group}"
    shell:
        "cutadapt -e 0.2 -O 5 -a AAGTCGGX -A AAGTCGGX -o {output.read1} -p {output.read2} {input.read1} {input.read2}"


rule bwa_mem:
    input:
        expand(['trimmed_reads/{{sample}}_{{group}}_R{read}.fq.gz', 
                read=[1,2])
    output:
        temp("mapped/{sample}_{group}.unsorted.sam")
    params:
        genome = "/path-to-genome"
    log:
        "mapped/log/{sample}_{group}_bwa_mem.log"
    benchmark:
        "benchmarks/bwa_mem/{sample}_{group}.txt"
    threads: 8
    shell:
        "bwa mem -R '@RG\tID:{wildcards.sample}_{wildcards.group}\tSM:{wildcards.sample}' -t {threads} {params.genome} {input} 2> {log} > {output}"


rule samtools_sort:
    input: 
        "mapped/{sample}_{group}.unsorted.bam"
    output:
        "mapped/{sample}_{group}.sorted.bam"
    shell:
        "samtools sort {input} > {output}"

rule samtools_merge:
    input: 
        expand('mapped/{{sample}}_{group}.sorted.bam', group=config['groups'])
    output:
        protected("merged/{sample}.merged.bam")
    shell:
        "samtools merge {output} {input}"

Syntax for merging technical replicates in Snakefile and config file

1 Answers