Snakemake: MissingInputException in snakemake pipeline

Question

I'm trying a SnakeMake pipeline and I'm stucked on an error I really don't understand.

I've got a directory (raw_data) in which I have the input files :

ll /home/nico/labo/etudes/Optimal/data/raw_data
total 41M
drwxrwxr-x  2 nico nico 4,0K mars   6 16:09 ./
drwxrwxr-x 11 nico nico 4,0K mars   6 16:14 ../
-rw-rw-r--  1 nico nico  15M févr. 27 12:21 sampleA_R1.fastq.gz
-rw-rw-r--  1 nico nico  19M févr. 27 12:22 sampleA_R2.fastq.gz
-rw-rw-r--  1 nico nico 3,4M févr. 27 12:21 sampleB_R1.fastq.gz
-rw-rw-r--  1 nico nico 4,3M févr. 27 12:22 sampleB_R2.fastq.gz

This directory contains 4 files for 2 samples. I created a config json file for the SnakeMake pipeline named config_snakemake_Optimal_mapping_BaL.json:

{
    "fastqExtension": "fastq.gz",
    "fastqDir": "/home/nico/labo/etudes/Optimal/data/raw_data",
    "outputDir": "/home/nico/labo/etudes/Optimal/data/mapping_BaL",
    "logDir": "logs",
    "reference": {
        "fasta": "/home/nico/labo/references/genomes/HIV1/BaL_AY713409/BaL_AY713409.fasta",
        "index": "/home/nico/labo/references/genomes/HIV1/BaL_AY713409/BaL_AY713409.fasta.bwt"
    }
}

And finally the SnakeMake file snakefile_bwa_samtools.py:

import subprocess
from os.path import join

### Globals ---------------------------------------------------------------------

# A Snakemake regular expression matching fastq files.

SAMPLES, = glob_wildcards(join(config["fastqDir"], "{sample}_R1."+config["fastqExtension"]))
print(SAMPLES)

### Rules -----------------------------------------------------------------------

# Pipeline output files
rule all:
    input: expand(join(config["outputDir"], "{sample}.bam.bai"), sample=SAMPLES)

# Reads alignment on reference genome and BAM file creation
rule bwa_mem_to_bam:
    input:
        index = config["reference"]["index"],
        fasta = config["reference"]["fasta"],
        fq1_ID = "{sample}_R1."+config["fastqExtension"],
        fq2_ID = "{sample}_R2."+config["fastqExtension"],
        fq1 = join(config["fastqDir"], "{sample}_R1."+config["fastqExtension"]),
        fq2 = join(config["fastqDir"], "{sample}_R2."+config["fastqExtension"])
    output:
        temp(join(config["outputDir"], "{sample}.bamUnsorted"))
    version:
        subprocess.getoutput(
        "man bwa | tail -n 1 | cut -d ' ' -f 1 | cut -d '-' -f 2"
        )
    log:
        join(config["outputDir"], config["logDir"], "{sample}.bwa_mem.log")
    message:
        "Alignment of {input.fq1_ID} and {input.fq2_ID} on {input.fasta} with BWA version {version}."
    shell:
        "bwa mem {input.fasta} {input.fq1} {input.fq2} 2> {log} | samtools view -Sbh - > {output}"

# Sorting the BAM files on genomic positions
rule bam_sort:
    input:
        join(config["outputDir"], "{sample}.bamUnsorted")
    output:
        join(config["outputDir"], "{sample}.bam")
    log:
        join(config["outputDir"], config["logDir"],  "{sample}.samtools_sort.log")
    version:
        subprocess.getoutput(
            "samtools --version | "
            "head -1 | "
            "cut -d' ' -f2"
        )
    message:
        "Genomic sorting of {input} with samtools version {version}."
    shell:
        "samtools sort -f {input} {output} 2> {log}"

# Indexing the BAM files
rule bam_index:
    input:
        join(config["outputDir"], "{sample}.bam")
    output:
        join(config["outputDir"], "{sample}.bam.bai")
    message:
        "Indexing {input}."
    shell:
        "samtools index {input}"

I run this pipeline:

snakemake --cores 3 --snakefile /home/nico/labo/scripts/pipeline_illumina/snakefile_bwa_samtools.py --configfile /home/nico/labo/etudes/Optimal/data/snakemake_config_files/config_snakemake_Optimal_mapping_BaL.json

and I've got the following error outputs:

['sampleB', 'sampleA']
MissingInputException in line 18 of /home/nico/labo/scripts/pipeline_illumina/snakefile_bwa_samtools.py:
Missing input files for rule bwa_mem_to_bam:
sampleB_R1.fastq.gz
sampleB_R2.fastq.gz

or depending the moment:

['sampleB', 'sampleA']
PeriodicWildcardError in line 40 of /home/nico/labo/scripts/pipeline_illumina/snakefile_bwa_samtools.py:
The value _unsorted in wildcard sample is periodically repeated (sampleB_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted_unsorted). This would lead to an infinite recursion. To avoid this, e.g. restrict the wildcards in this rule to certain values.

The samples are correctly detected as they appear in the list (first line of kind of outputs) and I'm surely messing around with the wildcards in the rule bwa_mem_to_bam, but I really don't get why.. Any clue?

Pereira Hugo Pereira Hugo · Accepted Answer · 2017-03-07T04:49:01

I quickly looked your code.

Why didn't the first one work out?

Look when you declare fq1_ID and fq1, same for sample 2. You didn't assign the same string. For fq1 you add a repertory for the file witch is not present for fq1_ID so snakemake is searching it in the workdir (current directory if -d option is not set) a file name with your string. Beacuse these variables are in input section.

So by removing the two fq1/2_ID, it will erase all files searching problems.

Hugo

Snakemake: MissingInputException in snakemake pipeline

3 Answers