I'm using the following config file format in snakemake for a some sequencing analysis practice (I have loads of samples each containing 2 fastq files:
samples:
Sample1_XY:
- fastq_files/SRR4356728_1.fastq.gz
- fastq_files/SRR4356728_2.fastq.gz
Sample2_AB:
- fastq_files/SRR6257171_1.fastq.gz
- fastq_files/SRR6257171_2.fastq.gz
I'm using the following rules at the start of my pipeline to run fastqc and for alignment of the fastqc files:
import os
# read config info into this namespace
configfile: "config.yaml"
rule all:
input:
expand("FastQC/{sample}_fastqc.zip", sample=config["samples"]),
expand("bam_files/{sample}.bam", sample=config["samples"]),
"FastQC/fastq_multiqc.html"
rule fastqc:
input:
sample=lambda wildcards: config['samples'][wildcards.sample]
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/{sample}_fastqc.html",
zip="FastQC/{sample}_fastqc.zip"
params: ""
wrapper:
"0.21.0/bio/fastqc"
rule bowtie2:
input:
sample=lambda wildcards: config['samples'][wildcards.sample]
output:
"bam_files/{sample}.bam"
log:
"logs/bowtie2/{sample}.txt"
params:
index=config["index"], # prefix of reference genome index (built with bowtie2-build),
extra=""
threads: 8
wrapper:
"0.21.0/bio/bowtie2/align"
rule multiqc_fastq:
input:
expand("FastQC/{sample}_fastqc.html", sample=config["samples"])
output:
"FastQC/fastq_multiqc.html"
params:
log:
"logs/multiqc.log"
wrapper:
"0.21.0/bio/multiqc"
My issue is with the fastqc rule.
Currently both the fastqc rule and the bowtie2 rule create one output file generated using two inputs SRRXXXXXXX_1.fastq.gz
and SRRXXXXXXX_2.fastq.gz
.
I need the fastq rule to generate two files, a separate one for each of the fastq.gz
files but I'm unsure how to index the config file correctly from the fastqc rule input statement, or how to combine the the expand and wildcards commands to solve this. I can get an individual fastq file by adding [0]
or [1]
to the end of the input statement, but not both run individually/separately.
I've been messing around trying to get the correct indexing format to access each file separately. The current format is the only one I've managed that allows snakemake -np
to generate a job list.
Any tips would be greatly appreciated.
fastq_files/SRR
instead offastq_files/SRR_[1|2].fastq.gz
in config file, and be explicit about1
and2
in rules' input and output? – Manavalan Gajapathy[0|1]
at the end of my fastqc input statement but it only ran theSRRXXXXXXX_2 files
. If I use[0]
or[1]
separately I get each individual file, but I need a way for both to run in a separate instance of the rule. Hopefully the amended question makes the issue clear now. – Darren