1
votes

I'm trying a SnakeMake pipeline and I'm stucked on an error I really don't understand.

I've got a directory (raw_data) in which I have the input files :

ll /home/nico/labo/etudes/Optimal/data/raw_data
total 41M
drwxrwxr-x  2 nico nico 4,0K mars   6 16:09 ./
drwxrwxr-x 11 nico nico 4,0K mars   6 16:14 ../
-rw-rw-r--  1 nico nico  15M févr. 27 12:21 sampleA_R1.fastq.gz
-rw-rw-r--  1 nico nico  19M févr. 27 12:22 sampleA_R2.fastq.gz
-rw-rw-r--  1 nico nico 3,4M févr. 27 12:21 sampleB_R1.fastq.gz
-rw-rw-r--  1 nico nico 4,3M févr. 27 12:22 sampleB_R2.fastq.gz

This directory contains 4 files for 2 samples. I created a config json file for the SnakeMake pipeline named config_snakemake_Optimal_mapping_BaL.json:

{
    "fastqExtension": "fastq.gz",
    "fastqDir": "/home/nico/labo/etudes/Optimal/data/raw_data",
    "outputDir": "/home/nico/labo/etudes/Optimal/data/mapping_BaL",
    "logDir": "logs",
    "reference": {
        "fasta": "/home/nico/labo/references/genomes/HIV1/BaL_AY713409/BaL_AY713409.fasta",
        "index": "/home/nico/labo/references/genomes/HIV1/BaL_AY713409/BaL_AY713409.fasta.bwt"
    }
}

And finally the SnakeMake file snakefile_bwa_samtools.py:

import subprocess
from os.path import join

### Globals ---------------------------------------------------------------------

# A Snakemake regular expression matching fastq files.

SAMPLES, = glob_wildcards(join(config["fastqDir"], "{sample}_R1."+config["fastqExtension"]))
print(SAMPLES)

### Rules -----------------------------------------------------------------------

# Pipeline output files
rule all:
    input: expand(join(config["outputDir"], "{sample}.bam.bai"), sample=SAMPLES)

# Reads alignment on reference genome and BAM file creation
rule bwa_mem_to_bam:
    input:
        index = config["reference"]["index"],
        fasta = config["reference"]["fasta"],
        fq1_ID = "{sample}_R1."+config["fastqExtension"],
        fq2_ID = "{sample}_R2."+config["fastqExtension"],
        fq1 = join(config["fastqDir"], "{sample}_R1."+config["fastqExtension"]),
        fq2 = join(config["fastqDir"], "{sample}_R2."+config["fastqExtension"])
    output:
        temp(join(config["outputDir"], "{sample}.bamUnsorted"))
    version:
        subprocess.getoutput(
        "man bwa | tail -n 1 | cut -d ' ' -f 1 | cut -d '-' -f 2"
        )
    log:
        join(config["outputDir"], config["logDir"], "{sample}.bwa_mem.log")
    message:
        "Alignment of {input.fq1_ID} and {input.fq2_ID} on {input.fasta} with BWA version {version}."
    shell:
        "bwa mem {input.fasta} {input.fq1} {input.fq2} 2> {log} | samtools view -Sbh - > {output}"

# Sorting the BAM files on genomic positions
rule bam_sort:
    input:
        join(config["outputDir"], "{sample}.bamUnsorted")
    output:
        join(config["outputDir"], "{sample}.bam")
    log:
        join(config["outputDir"], config["logDir"],  "{sample}.samtools_sort.log")
    version:
        subprocess.getoutput(
            "samtools --version | "
            "head -1 | "
            "cut -d' ' -f2"
        )
    message:
        "Genomic sorting of {input} with samtools version {version}."
    shell:
        "samtools sort -f {input} {output} 2> {log}"

# Indexing the BAM files
rule bam_index:
    input:
        join(config["outputDir"], "{sample}.bam")
    output:
        join(config["outputDir"], "{sample}.bam.bai")
    message:
        "Indexing {input}."
    shell:
        "samtools index {input}"

I run this pipeline:

snakemake --cores 3 --snakefile /home/nico/labo/scripts/pipeline_illumina/snakefile_bwa_samtools.py --configfile /home/nico/labo/etudes/Optimal/data/snakemake_config_files/config_snakemake_Optimal_mapping_BaL.json

and I've got the following error outputs:

['sampleB', 'sampleA']
MissingInputException in line 18 of /home/nico/labo/scripts/pipeline_illumina/snakefile_bwa_samtools.py:
Missing input files for rule bwa_mem_to_bam:
sampleB_R1.fastq.gz
sampleB_R2.fastq.gz

or depending the moment:

['sampleB', 'sampleA']
PeriodicWildcardError in line 40 of /home/nico/labo/scripts/pipeline_illumina/snakefile_bwa_samtools.py:
The value _unsorted in wildcard sample is periodically repeated (sampleB_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted). This would lead to an infinite recursion. To avoid this, e.g. restrict the wildcards in this rule to certain values.

The samples are correctly detected as they appear in the list (first line of kind of outputs) and I'm surely messing around with the wildcards in the rule bwa_mem_to_bam, but I really don't get why.. Any clue?

3

3 Answers

1
votes

I quickly looked your code.

Why didn't the first one work out?

Look when you declare fq1_ID and fq1, same for sample 2. You didn't assign the same string. For fq1 you add a repertory for the file witch is not present for fq1_ID so snakemake is searching it in the workdir (current directory if -d option is not set) a file name with your string. Beacuse these variables are in input section.

So by removing the two fq1/2_ID, it will erase all files searching problems.

Hugo

0
votes

Finally, I succed with the pipeline removing the fq1_ID and fq2_ID variables in the rule bwa_mem_to_bam and replacing in the message of the rule input.fq1_ID and input.fq2_ID by input.fq1 and input.fq2.

The message is less elegant, but the pipeline is running correctly. Still doesn't understand exactly where was the mistake, if someone can explain, I'm still listening!

The correct code for rule bwa_mem_to_bam:

rule bwa_mem_to_bam:
input:
    index = config["reference"]["index"],
    fasta = config["reference"]["fasta"],
    fq1 = join(config["fastqDir"], "{sample}_R1."+config["fastqExtension"]),
    fq2 = join(config["fastqDir"], "{sample}_R2."+config["fastqExtension"])
output:
    temp(join(config["outputDir"], "{sample}.bamUnsorted"))
version:
    subprocess.getoutput(
    "man bwa | tail -n 1 | cut -d ' ' -f 1 | cut -d '-' -f 2"
    )
log:
    join(config["outputDir"], config["logDir"], "{sample}.bwa_mem.log")
message:
    "Alignment of {input.fq1} and {input.fq2} on {input.fasta} with BWA version {version}."
shell:
    "bwa mem {input.fasta} {input.fq1} {input.fq2} 2> {log} | samtools view -Sbh - > {output}"
0
votes

Thanks Hugo for checking my code and your explanation, it makes sense! I finally get a flash idea waking up this morning (the best ones), and realized that I neglected the params part of the rule, fq1_ID and fq2_ID are not inputs but params.. I changed the code to that:

rule bwa_mem_to_bam:
input:
    index = config["reference"]["index"],
    fasta = config["reference"]["fasta"],
    fq1 = join(config["fastqDir"], "{sample}_R1.fastq.gz"),
    fq2 = join(config["fastqDir"], "{sample}_R2.fastq.gz")
output:
    temp(join(config["outputDir"],"{sample}_unsorted.bam"))
params:
    fq1_ID = "{sample}_R1.fastq.gz",
    fq2_ID = "{sample}_R2.fastq.gz",
    ref_ID = os.path.basename(config["reference"]["fasta"])
version:
    subprocess.getoutput(
    "man bwa | tail -n 1 | cut -d ' ' -f 1 | cut -d '-' -f 2"
    )
log:
    join(config["outputDir"], config["logDir"], "{sample}.bwa_mem.log")
message:
    "Alignment of {params.fq1_ID} and {params.fq2_ID} on {params.ref_ID} with BWA version {version}."
shell:
    "bwa mem {input.fasta} {input.fq1} {input.fq2} 2> {log} | samtools view -Sbh - > {output}"

And it works just fine!

snakemake --cores 3 --snakefile /home/nico/labo/scripts/pipeline_illumina/snakefile_bwa_samtools.py --configfile /home/nico/labo/etudes/Optimal/data/snakemake_config_files/config_snakemake_Optimal_mapping_BaL.json 
Provided cores: 3
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1   all
    2   bam_index
    2   bam_sort
    2   bwa_mem_to_bam
    7
Alignment of sampleB_R1.fastq.gz and sampleB_R2.fastq.gz on BaL_AY713409.fasta with BWA version 0.7.12.

Alignment of sampleA_R1.fastq.gz and sampleA_R2.fastq.gz on BaL_AY713409.fasta with BWA version 0.7.12.

1 of 7 steps (14%) done
Genomic sorting of sampleB_unsorted.bam with samtools version 1.2.

Removing temporary output file /home/nico/labo/etudes/Optimal/data/mapping_BaL/sampleB_unsorted.bam.
2 of 7 steps (29%) done
Indexing sampleB.bam.

3 of 7 steps (43%) done
4 of 7 steps (57%) done
Genomic sorting of sampleA_unsorted.bam with samtools version 1.2.

Removing temporary output file /home/nico/labo/etudes/Optimal/data/mapping_BaL/sampleA_unsorted.bam.
5 of 7 steps (71%) done
Indexing sampleA.bam.

6 of 7 steps (86%) done
localrule all:
    input: /home/nico/labo/etudes/Optimal/data/mapping_BaL/sampleB.bam.bai, /home/nico/labo/etudes/Optimal/data/mapping_BaL/sampleA.bam.bai

7 of 7 steps (100%) done

And finally get my correct messages:

  • Alignment of sampleB_R1.fastq.gz and sampleB_R2.fastq.gz on BaL_AY713409.fasta with BWA version 0.7.12.
  • Alignment of sampleA_R1.fastq.gz and sampleA_R2.fastq.gz on BaL_AY713409.fasta with BWA version 0.7.12.