Snakemake: How to use config file efficiently

Question

I'm using the following config file format in snakemake for a some sequencing analysis practice (I have loads of samples each containing 2 fastq files:

samples:
Sample1_XY:
    - fastq_files/SRR4356728_1.fastq.gz
    - fastq_files/SRR4356728_2.fastq.gz
Sample2_AB:
    - fastq_files/SRR6257171_1.fastq.gz
    - fastq_files/SRR6257171_2.fastq.gz

I'm using the following rules at the start of my pipeline to run fastqc and for alignment of the fastqc files:

import os
# read config info into this namespace
configfile: "config.yaml"

rule all:
    input:
    expand("FastQC/{sample}_fastqc.zip", sample=config["samples"]),
    expand("bam_files/{sample}.bam", sample=config["samples"]),
    "FastQC/fastq_multiqc.html"

rule fastqc:
    input:
        sample=lambda wildcards: config['samples'][wildcards.sample]
    output:
        # Output needs to end in '_fastqc.html' for multiqc to work
        html="FastQC/{sample}_fastqc.html",
        zip="FastQC/{sample}_fastqc.zip"
    params: ""
        wrapper:
        "0.21.0/bio/fastqc"

rule bowtie2:
    input:
         sample=lambda wildcards: config['samples'][wildcards.sample]
    output:
         "bam_files/{sample}.bam"
    log:
         "logs/bowtie2/{sample}.txt"
    params:
         index=config["index"],  # prefix of reference genome index (built with bowtie2-build),
    extra=""
         threads: 8
    wrapper:
         "0.21.0/bio/bowtie2/align"

 rule multiqc_fastq:
    input:
         expand("FastQC/{sample}_fastqc.html", sample=config["samples"])
    output:
         "FastQC/fastq_multiqc.html"
    params:
    log:
         "logs/multiqc.log"
    wrapper:
         "0.21.0/bio/multiqc"

My issue is with the fastqc rule.

Currently both the fastqc rule and the bowtie2 rule create one output file generated using two inputs SRRXXXXXXX_1.fastq.gz and SRRXXXXXXX_2.fastq.gz.

I need the fastq rule to generate two files, a separate one for each of the fastq.gz files but I'm unsure how to index the config file correctly from the fastqc rule input statement, or how to combine the the expand and wildcards commands to solve this. I can get an individual fastq file by adding [0] or [1] to the end of the input statement, but not both run individually/separately.

I've been messing around trying to get the correct indexing format to access each file separately. The current format is the only one I've managed that allows snakemake -np to generate a job list.

Any tips would be greatly appreciated.

How about using fastq_files/SRR instead of fastq_files/SRR_[1|2].fastq.gz in config file, and be explicit about 1 and 2 in rules' input and output? — Manavalan Gajapathy
Hi JeeYem, thanks for your suggestion. I tried this but it doesn't work as my file names are a bit more complex than what my original post said, and I have about 50 fastq files in total with unique fastq names and unique samples names. I have amended this now. I also tried adding [0|1] at the end of my fastqc input statement but it only ran the SRRXXXXXXX_2 files. If I use [0] or [1] separately I get each individual file, but I need a way for both to run in a separate instance of the rule. Hopefully the amended question makes the issue clear now. — Darren

Manavalan Gajapathy Manavalan Gajapathy · Accepted Answer · 2018-05-03T16:21:35

It appears each sample would have two fastq files, and they are named in format ***_1.fastq.gz and ***_2.fastq.gz. In that case, config and code below would work.

config.yaml:

samples:
    Sample_A: fastq_files/SRR4356728
    Sample_B: fastq_files/SRR6257171

Snakefile:

# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])

rule all:
    input:
        expand("FastQC/{sample}_{num}_fastqc.zip", sample=config["samples"], num=['1', '2']),
        expand("bam_files/{sample}.bam", sample=config["samples"]),
        "FastQC/fastq_multiqc.html"

rule fastqc:
    input:
        sample=lambda wildcards: f"{config['samples'][wildcards.sample]}_{wildcards.num}.fastq.gz"
    output:
        # Output needs to end in '_fastqc.html' for multiqc to work
        html="FastQC/{sample}_{num}_fastqc.html",
        zip="FastQC/{sample}_{num}_fastqc.zip"
    wrapper:
        "0.21.0/bio/fastqc"

rule bowtie2:
    input:
         sample=lambda wildcards: expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
    output:
         "bam_files/{sample}.bam"
    wrapper:
         "0.21.0/bio/bowtie2/align"

rule multiqc_fastq:
    input:
        expand("FastQC/{sample}_{num}_fastqc.html", sample=config["samples"], num=['1', '2'])
    output:
        "FastQC/fastq_multiqc.html"
    wrapper:
        "0.21.0/bio/multiqc"

Snakemake: How to use config file efficiently

1 Answers