3
votes

snakemake deletes all output files that are marked temporary but does not do anything to the files if the output is a directory as shown below:

rule all:
    input:
        'final.txt',

checkpoint split_big_file:
    input: 'bigfile.txt'
    output: temp(directory('split_files'))
    shell: 'mkdir -p {output} ; split -l 5000 -d -e bigfile.txt {output}/part_'

rule copy_small_files:
    input: 'split_files/part_{num}'
    output: temp('copy_files/part_{num}.txt')
    shell: 'cp -f {input} {output}'

def aggregate_input(wildcards):
    '''
    aggregate the file names of the random number of files
    generated at the scatter step
    '''
    checkpoint_output = checkpoints.split_big_file.get(**wildcards).output[0]
    print(checkpoint_output)
    agg_inp = expand('copy_files/part_{num}.txt', num=glob_wildcards('split_files/part_{num}').num)
    print(agg_inp)
    return agg_inp

rule merge_small_files:
    input: aggregate_input
    output: 'final.txt'
    shell: 'cat {input} > {output}'

When I run the code shown above with a bigfile.txt that has several thousand lines, everything runs fine but the split_files directory is not empty.

$ wc -l final.txt 
  61177 final.txt
$ wc -l bigfile.txt 
  61177 bigfile.txt
$ ls copy_files/
$ ls split_files/
  part_00  part_01  part_02  part_03  part_04  
  part_05  part_06  part_07  part_08  part_09  
  part_10  part_11  part_12

What I would like to see:

  1. copy_files directory should also be deleted (but apparently since snakemake cannot figure out whether there are any other files unrelated to snakemake in that directory it will not delete directories by default)
  2. contents of the split_files directory (and preferably the directory itself; see point 1 above) should be deleted.
1

1 Answers

0
votes

I can not recreate it:

rule all:
    input:
        "a.txt"

rule first:
    output:
        temp(directory("dir1"))
    shell:
        "mkdir {output}; touch {output}/a.txt; sleep 5"

rule second:
    input:
        "dir1"
    output:
        "a.txt"
    shell:
        "touch {output}"

What version of snakemake are you using? Is maybe output_dir listed under rule all for you? Snakemake assumes that the output you want is the input of your first rule (rule all probably). So it won't delete those files, removing output_dir from under rule all will solve this issue.

However I am just guessing since you didn't provide a minimal reproducible example.

edit

Hmm... That should work! Here are two non-ideal solutions I could come up with:

We can fool snakemake to again re-evaluate the DAG and then delete the folder like this, however not sure if the files get deleted early enough for you (files might be very large).

rule merge_small_files:
    input: aggregate=aggregate_input, dummy='split_files'
    output: 'final.txt'
    shell: 'cat {input.aggregate} > {output}'

Or just delete the file after copying, however you will end up with an empty folder in the end:

rule copy_small_files:
    input: 'split_files/part_{num}'
    output: temp('copy_files/part_{num}.txt')
    shell: 'cp -f {input} {output}; rm {input}'

You can ofcourse combine both solutions and have the best of both worlds, however it is not very pretty to look at unfortunately :(