I'm trying a SnakeMake pipeline and I'm stucked on an error I really don't understand.
I've got a directory (raw_data
) in which I have the input files :
ll /home/nico/labo/etudes/Optimal/data/raw_data
total 41M
drwxrwxr-x 2 nico nico 4,0K mars 6 16:09 ./
drwxrwxr-x 11 nico nico 4,0K mars 6 16:14 ../
-rw-rw-r-- 1 nico nico 15M févr. 27 12:21 sampleA_R1.fastq.gz
-rw-rw-r-- 1 nico nico 19M févr. 27 12:22 sampleA_R2.fastq.gz
-rw-rw-r-- 1 nico nico 3,4M févr. 27 12:21 sampleB_R1.fastq.gz
-rw-rw-r-- 1 nico nico 4,3M févr. 27 12:22 sampleB_R2.fastq.gz
This directory contains 4 files for 2 samples.
I created a config json file for the SnakeMake pipeline named config_snakemake_Optimal_mapping_BaL.json
:
{
"fastqExtension": "fastq.gz",
"fastqDir": "/home/nico/labo/etudes/Optimal/data/raw_data",
"outputDir": "/home/nico/labo/etudes/Optimal/data/mapping_BaL",
"logDir": "logs",
"reference": {
"fasta": "/home/nico/labo/references/genomes/HIV1/BaL_AY713409/BaL_AY713409.fasta",
"index": "/home/nico/labo/references/genomes/HIV1/BaL_AY713409/BaL_AY713409.fasta.bwt"
}
}
And finally the SnakeMake file snakefile_bwa_samtools.py
:
import subprocess
from os.path import join
### Globals ---------------------------------------------------------------------
# A Snakemake regular expression matching fastq files.
SAMPLES, = glob_wildcards(join(config["fastqDir"], "{sample}_R1."+config["fastqExtension"]))
print(SAMPLES)
### Rules -----------------------------------------------------------------------
# Pipeline output files
rule all:
input: expand(join(config["outputDir"], "{sample}.bam.bai"), sample=SAMPLES)
# Reads alignment on reference genome and BAM file creation
rule bwa_mem_to_bam:
input:
index = config["reference"]["index"],
fasta = config["reference"]["fasta"],
fq1_ID = "{sample}_R1."+config["fastqExtension"],
fq2_ID = "{sample}_R2."+config["fastqExtension"],
fq1 = join(config["fastqDir"], "{sample}_R1."+config["fastqExtension"]),
fq2 = join(config["fastqDir"], "{sample}_R2."+config["fastqExtension"])
output:
temp(join(config["outputDir"], "{sample}.bamUnsorted"))
version:
subprocess.getoutput(
"man bwa | tail -n 1 | cut -d ' ' -f 1 | cut -d '-' -f 2"
)
log:
join(config["outputDir"], config["logDir"], "{sample}.bwa_mem.log")
message:
"Alignment of {input.fq1_ID} and {input.fq2_ID} on {input.fasta} with BWA version {version}."
shell:
"bwa mem {input.fasta} {input.fq1} {input.fq2} 2> {log} | samtools view -Sbh - > {output}"
# Sorting the BAM files on genomic positions
rule bam_sort:
input:
join(config["outputDir"], "{sample}.bamUnsorted")
output:
join(config["outputDir"], "{sample}.bam")
log:
join(config["outputDir"], config["logDir"], "{sample}.samtools_sort.log")
version:
subprocess.getoutput(
"samtools --version | "
"head -1 | "
"cut -d' ' -f2"
)
message:
"Genomic sorting of {input} with samtools version {version}."
shell:
"samtools sort -f {input} {output} 2> {log}"
# Indexing the BAM files
rule bam_index:
input:
join(config["outputDir"], "{sample}.bam")
output:
join(config["outputDir"], "{sample}.bam.bai")
message:
"Indexing {input}."
shell:
"samtools index {input}"
I run this pipeline:
snakemake --cores 3 --snakefile /home/nico/labo/scripts/pipeline_illumina/snakefile_bwa_samtools.py --configfile /home/nico/labo/etudes/Optimal/data/snakemake_config_files/config_snakemake_Optimal_mapping_BaL.json
and I've got the following error outputs:
['sampleB', 'sampleA']
MissingInputException in line 18 of /home/nico/labo/scripts/pipeline_illumina/snakefile_bwa_samtools.py:
Missing input files for rule bwa_mem_to_bam:
sampleB_R1.fastq.gz
sampleB_R2.fastq.gz
or depending the moment:
['sampleB', 'sampleA']
PeriodicWildcardError in line 40 of /home/nico/labo/scripts/pipeline_illumina/snakefile_bwa_samtools.py:
The value _unsorted in wildcard sample is periodically repeated (sampleB_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted). This would lead to an infinite recursion. To avoid this, e.g. restrict the wildcards in this rule to certain values.
The samples are correctly detected as they appear in the list (first line of kind of outputs) and I'm surely messing around with the wildcards in the rule bwa_mem_to_bam
, but I really don't get why..
Any clue?