I have a snakemake workflow, which fails, because the last job creates either two output files, or none. I tried solving it with checkpoint, but I think I am stuck with the wildcards when trying to collate the outfiles in the aggregate function.
The workflow (1) creates a fasta file from a biom community profile. Then runs an in silico PCR (2) on the fasta file, which creates a txt file as output.
The last step is the parser (3), which outputs a csv and a fasta file. However if there are no matches in the txt file (aka the insilico PCR yielded no results), then it doesn't create a csv or fasta file.
SAMPLES, = glob_wildcards("input/metaphlan/{sample}.biom")
ID = "0 1 2 3 4".split()
TARGETS = expand("output/metaphlan/isPCR/final/{id}_mismatch_{sample}.fasta", sample = SAMPLES, id = ID)
rule all:
input:
TARGETS
rule getgenome:
input:
"input/metaphlan/{sample}.biom"
output:
csv="output/metaphlan/fasta_dump/{sample}.csv",
fas="output/metaphlan/fasta_dump/{sample}_dump.fasta"
conda:
"envs/synth_genome.yaml"
shell:
"python scripts/get_genomes_noabund_Snakemake.py {input} 1 {output.fas} {output.csv}"
rule PCR:
input:
"output/metaphlan/fasta_dump/{sample}_dump.fasta"
output:
"output/metaphlan/isPCR/raw/{id}_mismatch/{sample}.txt"
params:
id = "{id}"
shell:
"software/exonerate-2.2.0-x86_64/bin/ipcress --products --mismatch {params.id} scripts/primers-miseq.txt {input} > {output}"
rule parse:
input:
"output/metaphlan/isPCR/raw/{id}_mismatch/{sample}.txt"
output:
"output/metaphlan/isPCR/final/{id}_mismatch_{sample}.csv",
"output/metaphlan/isPCR/final/{id}_mismatch_{sample}.fasta"
shell:
"python scripts/iPCRess_parser_v2.py {input} {output}"
The dry run is fine - no errors. But if I do the proper run, snakemake aborts it saying a job execution failed:
Waiting at most 5 seconds for missing files.
MissingOutputException in line 31 of snakeflow/Snakefile:
Missing files after 5 seconds:
output/metaphlan/isPCR/final/2_mismatch_metaphlan_rectal_SRR5907487.csv
output/metaphlan/isPCR/final/2_mismatch_metaphlan_rectal_SRR5907487.fasta
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
I know I could either change the parser script to just create two empty files, but I don't want to create unnecessary files. I looked into dynamic, but that doesn't work with two potential output files, so I had a look at checkpoint. As I understand it, that should help me solve the problem.
Here is my attempt at using checkpoints:
SAMPLES, = glob_wildcards("input/metaphlan/{sample}.biom")
ID = "0 1 2 3 4".split()
TARGETS = expand("output/metaphlan/isPCR/final/{id}_mismatch_{sample}n.txt", sample = SAMPLES, id = ID)
print(TARGETS)
rule all:
input:
TARGETS
rule getgenome:
input:
"input/metaphlan/{sample}.biom"
output:
csv="output/metaphlan/fasta_dump/{sample}.csv",
fas="output/metaphlan/fasta_dump/{sample}_dump.fasta"
conda:
"envs/synth_genome.yaml"
shell:
"python scripts/get_genomes_noabund_Snakemake.py {input} 1 {output.fas} {output.csv}"
rule PCR:
input:
"output/metaphlan/fasta_dump/{sample}_dump.fasta"
output:
"output/metaphlan/isPCR/raw/{id}_mismatch/{sample}.txt"
params:
id = "{id}"
shell:
"software/exonerate-2.2.0-x86_64/bin/ipcress --products --mismatch {params.id} scripts/primers-miseq.txt {input} > {output}"
checkpoint parse:
input:
"output/metaphlan/isPCR/raw/{id}_mismatch/{sample}.txt"
output:
"output/metaphlan/isPCR/final/{id}_mismatch_{sample}.csv",
"output/metaphlan/isPCR/final/{id}_mismatch_{sample}.fasta"
shell:
"python scripts/iPCRess_parser_v2.py {input} {output}"
def aggregate_input(wildcards):
checkpoint_output = checkpoints.parse.get(**wildcards).output[0,1]
return expand('output/metaphlan/isPCR/final/{id}_mismatch_{sample}.csv','output/metaphlan/isPCR/final/{id}_mismatch_{sample}.fasta', sample = wildcards.SAMPLES, id=wildcards.ID)
rule collect:
input:
aggregate_input
output:
"output/metaphlan/isPCR/final/{id}_mismatch_{sample}n.txt"
shell:
"cat {input} >> {output}"
and the error
<function aggregate_input at 0x7f63eade2158>
SyntaxError:
Input and output files have to be specified as strings or lists of strings.
File "snakeflow/Snakefile", line 52, in <module>
I believe this is because there is something wrong with how I am using wildcards in the aggregate function, but I can't figure it out. I have tried various versions that I have found in the checkpoint tutorials to no avail.
Any help much appreciated, thanks!