Snakemake: Avoid removing output files before executing the shell command

Question

Is there a possibility to avoid that the output files defined in a snakemake rule are deleted before executing the shell command? I found a description of this behaviour here: http://snakemake.readthedocs.io/en/stable/project_info/faq.html#can-the-output-of-a-rule-be-a-symlink

What I am trying to do is definining a rule for a list of input and a list of output files (N:M relation). This rule should be triggered if one of the input files has changed. The python script which is called in the shell command then creates only those output which do not exist or whose content has changed in comparison to the already existing files (i.e. a change detection is implemented inside the python script). I expected that something like the following rule should solve this, but as the output.jsons are deleted before running the python script, all output.jsons will be created with a new timestamp instead of only those which have changed.

rule jsons:
"Create transformation files out of landmark correspondences."
input:
    matchfiles = ["matching/%04i-%04i.h5" % (SECTIONS[i], SECTIONS[i+1]) for i in range(len(SECTIONS)-1)]
output:
    jsons = ["transformation/{section}_transformation.json".format(section=s) for s in SECTIONS]
shell:
    "python create_transformation_jsons.py --matchfiles {input.matchfiles} --outfiles {output.jsons}"

If there is no possibility to avoid the deletion of output files in Snakemake, does anybody has another idea how to map this workflow into a snakemake rule without updating all output files?

Update:

I tried to solve this problem by changing the Snakemake source code. I removed the line self.remove_existing_output() in jobs.py to avoid removing output files before executing a rule. Furthermore, I added the parameter no_touch=True when self.dag.check_and_touch_output() is called in executors.handle_job_success. This worked great as the output files now were neither removed before nor touched after the rule is executed. But following rules with json files as input are still triggered for each json file (even if it did not change) as Snakemake recognizes that the json file was defined as an output before and theremore must have been changed. So I think avoiding the deletion of output files does not solve my problem, maybe a workaround - if existing - is the only way...

Update 2:

I also tried to find a workaround without changing the Snakemake source code by changing the output path of the above defined jsons rule to transformation/tmp/... and adding the following rule:

def cmp_jsons(wildcards):
    section = int(wildcards.section)
    # compare json for given section in transformation/ with json in transformation/tmp/
    # return [] if json did not change
    # return path to tmp json filename if json has changed
rule copy:
    input:
        json_tmp = cmp_jsons
    output:
        jsonfile = "transformation/B21_{section,\d+}_affine_transformation.json"
    shell:
        "cp {input.json_tmp} {output.jsonfile}"

But as the input function is evaluated before the workflow starts, the tmp-jsons are either not yet existing or not yet updated by the jsons rule and therefore the comparison won't be correct.

Foldager Foldager · Accepted Answer · 2018-03-19T19:25:30

This is a bit more involved, but I think it would work seamlessly for you.

The solution involves calling snakemake twice, but you can wrap it up in a shell script. In the first call you use snakemake in --dryrun to figure out which jsons will be updated, and in the second call this info is used to make the DAG. I use --config to switch between the two modes. Here is the Snakefile.

def get_match_files(wildcards):
    """Used by jsons_fake to figure which match files each json file depend on"""
    section = wildcards.section

    ### Do stuff to figure out what matching files this json depend on
    # YOUR CODE GOES HERE
    idx = SECTIONS.index(int(section)) # I have no idea if this is what you need
    matchfiles = ["matching/%04i-%04i.h5" % (SECTIONS[idx], SECTIONS[idx + 1])]

    return matchfiles

def get_json_output_files(fn):
    """Used by jsons. Read which json files will be updated from fn"""
    try:
        json_files = []
        with open(fn, 'r') as fh:
            for line in fh:
                if not line:
                    continue  # skip empty lines
                split_line = line.split(maxsplit=1)
                if split_line[0] == "output:":
                    json_files.append(split_line[1])  # Assumes there is only 1 output file pr line. If more, modify.
    except FileNotFoundError:
        print(f"Warning, could not find {fn}. Updating all json files.")
        json_files = expand("transformation/{section}_transformation.json", section=SECTIONS)

    return json_files


if "configuration_run" in config:
    rule jsons_fake:
        "Fake rule used for figuring out which json files will be created."
        input:
            get_match_files
        output:
            jsons = "transformation/{section}_transformation.json"
        run:
            raise NotImplementedError("This rule is not meant to be executed")

    rule jsons_all:
        input: expand("transformation/{s}_transformation.json", s=SECTIONS]

else:
    rule jsons:
        "Create transformation files out of landmark correspondences."
        input:
            matchfiles = ["matching/%04i-%04i.h5" % (SECTIONS[i], SECTIONS[i+1]) for i in range(len(SECTIONS)-1)]
        output:
            jsons = get_json_output_files('json_dryrun') # This is called at rule creation
        params:
            jsons=expand("transformation/{s}_transformation.json", s=SECTIONS]
        run:
            shell("python create_transformation_jsons.py --matchfiles {input.matchfiles} --outfiles {params.jsons}")

To avoid calling Snakemake twice you can wrap it in a shell script, mysnakemake

#!/usr/bin/env bash

snakemake jsons_all --dryrun --config configuration_run=yes | grep -A 2 'jsons_fake:' > json_dryrun
snakemake $@

And call the script like you would normally call snakemake, eg: mysnakemake all -j 2. Does this work for you? I haven't tested all parts of the code, so take it with a grain of salt.

Snakemake: Avoid removing output files before executing the shell command

2 Answers