snakemake deletes all output files that are marked temporary but does not do anything to the files if the output is a directory as shown below:
rule all:
input:
'final.txt',
checkpoint split_big_file:
input: 'bigfile.txt'
output: temp(directory('split_files'))
shell: 'mkdir -p {output} ; split -l 5000 -d -e bigfile.txt {output}/part_'
rule copy_small_files:
input: 'split_files/part_{num}'
output: temp('copy_files/part_{num}.txt')
shell: 'cp -f {input} {output}'
def aggregate_input(wildcards):
'''
aggregate the file names of the random number of files
generated at the scatter step
'''
checkpoint_output = checkpoints.split_big_file.get(**wildcards).output[0]
print(checkpoint_output)
agg_inp = expand('copy_files/part_{num}.txt', num=glob_wildcards('split_files/part_{num}').num)
print(agg_inp)
return agg_inp
rule merge_small_files:
input: aggregate_input
output: 'final.txt'
shell: 'cat {input} > {output}'
When I run the code shown above with a bigfile.txt
that has several thousand lines, everything runs fine but the split_files
directory is not empty.
$ wc -l final.txt
61177 final.txt
$ wc -l bigfile.txt
61177 bigfile.txt
$ ls copy_files/
$ ls split_files/
part_00 part_01 part_02 part_03 part_04
part_05 part_06 part_07 part_08 part_09
part_10 part_11 part_12
What I would like to see:
copy_files
directory should also be deleted (but apparently since snakemake cannot figure out whether there are any other files unrelated to snakemake in that directory it will not delete directories by default)- contents of the
split_files
directory (and preferably the directory itself; see point 1 above) should be deleted.