0
votes

I have a snakemake workflow, which fails, because the last job creates either two output files, or none. I tried solving it with checkpoint, but I think I am stuck with the wildcards when trying to collate the outfiles in the aggregate function.

The workflow (1) creates a fasta file from a biom community profile. Then runs an in silico PCR (2) on the fasta file, which creates a txt file as output.

The last step is the parser (3), which outputs a csv and a fasta file. However if there are no matches in the txt file (aka the insilico PCR yielded no results), then it doesn't create a csv or fasta file.

SAMPLES, = glob_wildcards("input/metaphlan/{sample}.biom")
ID = "0 1 2 3 4".split()

TARGETS = expand("output/metaphlan/isPCR/final/{id}_mismatch_{sample}.fasta", sample = SAMPLES, id = ID)

rule all:
    input:
        TARGETS

rule getgenome:
    input:
        "input/metaphlan/{sample}.biom"
    output:
        csv="output/metaphlan/fasta_dump/{sample}.csv",
        fas="output/metaphlan/fasta_dump/{sample}_dump.fasta"
    conda:
        "envs/synth_genome.yaml"
    shell:
        "python scripts/get_genomes_noabund_Snakemake.py {input} 1 {output.fas} {output.csv}"

rule PCR:
    input:
        "output/metaphlan/fasta_dump/{sample}_dump.fasta"
    output:
        "output/metaphlan/isPCR/raw/{id}_mismatch/{sample}.txt"
    params:
        id = "{id}"
    shell:
        "software/exonerate-2.2.0-x86_64/bin/ipcress --products --mismatch {params.id} scripts/primers-miseq.txt {input} > {output}"

rule parse:
    input:
        "output/metaphlan/isPCR/raw/{id}_mismatch/{sample}.txt"
    output:
        "output/metaphlan/isPCR/final/{id}_mismatch_{sample}.csv",
        "output/metaphlan/isPCR/final/{id}_mismatch_{sample}.fasta"
    shell:
        "python scripts/iPCRess_parser_v2.py {input} {output}"

The dry run is fine - no errors. But if I do the proper run, snakemake aborts it saying a job execution failed:

Waiting at most 5 seconds for missing files.
MissingOutputException in line 31 of snakeflow/Snakefile:
Missing files after 5 seconds:
output/metaphlan/isPCR/final/2_mismatch_metaphlan_rectal_SRR5907487.csv
output/metaphlan/isPCR/final/2_mismatch_metaphlan_rectal_SRR5907487.fasta
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.

I know I could either change the parser script to just create two empty files, but I don't want to create unnecessary files. I looked into dynamic, but that doesn't work with two potential output files, so I had a look at checkpoint. As I understand it, that should help me solve the problem.

Here is my attempt at using checkpoints:

SAMPLES, = glob_wildcards("input/metaphlan/{sample}.biom")
ID = "0 1 2 3 4".split()

TARGETS = expand("output/metaphlan/isPCR/final/{id}_mismatch_{sample}n.txt", sample = SAMPLES, id = ID)
print(TARGETS)

rule all:
    input:
        TARGETS

rule getgenome:
    input:
        "input/metaphlan/{sample}.biom"
    output:
        csv="output/metaphlan/fasta_dump/{sample}.csv",
        fas="output/metaphlan/fasta_dump/{sample}_dump.fasta"
    conda:
        "envs/synth_genome.yaml"
    shell:
        "python scripts/get_genomes_noabund_Snakemake.py {input} 1 {output.fas} {output.csv}"

rule PCR:
    input:
        "output/metaphlan/fasta_dump/{sample}_dump.fasta"
    output:
        "output/metaphlan/isPCR/raw/{id}_mismatch/{sample}.txt"
    params:
        id = "{id}"
    shell:
        "software/exonerate-2.2.0-x86_64/bin/ipcress --products --mismatch {params.id} scripts/primers-miseq.txt {input} > {output}"

checkpoint parse:
    input:
        "output/metaphlan/isPCR/raw/{id}_mismatch/{sample}.txt"
    output:
        "output/metaphlan/isPCR/final/{id}_mismatch_{sample}.csv",
        "output/metaphlan/isPCR/final/{id}_mismatch_{sample}.fasta"
    shell:
        "python scripts/iPCRess_parser_v2.py {input} {output}"

def aggregate_input(wildcards):
    checkpoint_output = checkpoints.parse.get(**wildcards).output[0,1]
    return expand('output/metaphlan/isPCR/final/{id}_mismatch_{sample}.csv','output/metaphlan/isPCR/final/{id}_mismatch_{sample}.fasta', sample = wildcards.SAMPLES, id=wildcards.ID)

rule collect:
    input:
        aggregate_input
    output:
        "output/metaphlan/isPCR/final/{id}_mismatch_{sample}n.txt"
    shell:
        "cat {input} >> {output}"

and the error


<function aggregate_input at 0x7f63eade2158>
SyntaxError:
Input and output files have to be specified as strings or lists of strings.
  File "snakeflow/Snakefile", line 52, in <module>

I believe this is because there is something wrong with how I am using wildcards in the aggregate function, but I can't figure it out. I have tried various versions that I have found in the checkpoint tutorials to no avail.

Any help much appreciated, thanks!

1

1 Answers

1
votes

I think the immediate error has to do with your expand:

return expand('output/metaphlan/isPCR/final/{id}_mismatch_{sample}.csv',
   'output/metaphlan/isPCR/final/{id}_mismatch_{sample}.fasta', 
   sample=wildcards.SAMPLES, id=wildcards.ID)

SAMPLES and ID should be lower case to match the wildcard names.

You will still run into the missing output exception because you are specifying output files that may not exist after running your script. You have to change the output of the checkpoint to a directory (specific for each sample, id) which will have 0 or 2 files. Then you can glob the contents of that directory in the input function to see which files exist.

For me, I would go with the empty file route. You get to avoid checkpoints and it keeps the rules cleaner. With empty files you can differentiate something that hasn't been run from something that had no results. Notice that if you use checkpoints, you end up with empty directories instead so you aren't avoiding the emtpy file problem entirely.

If you are worried about inodes or something else in your system, mark the outputs as temp and snakemake will clean them up after aggregation.