In my workflow graph, I merge one or more input files based on a suffix. In the cases where there is only one file for a given suffix, the merge operation is trivial, and can be done locally. In the case where there are multiple such files to merge, the merge operation needs to run on the cluster and takes hours.
Is there any way to specify/select that a particular rule should be used, if I know of all the input files ahead of time? In other words, is there a way to express rule precedence based on the number of inputs to a rule?
def _all_files_for_the_sample(wildcards):
# lookup all known files and return list
# of files matching wildcards.sample
...
# these two rules effectively have the same structure.
# I am omitting the implementation
rule super_fast_local_merge:
input: _all_files_for_the_sample
output: merged_{sample}.txt
...
rule super_slow_merge:
input: _all_files_for_the_sample
output: merged_{sample}.txt
...
Now, I also have rules which perform computation on the outputs of either of the rules above. The manual mentions that when linking multiple chains of rules, it is more efficient to refer to the symbols from the rules global directly (e.g. stating rules.super_slow_merge.output
as opposed to duplicating merged_{sample}.txt
in a different rule). I was led to believe that by aliasing a particular rule's output, I would be able to influence the shape of the graph:
def _choose_merged_file(wildcards):
all_inputs = _all_files_for_the_sample(wildcards.sample)
if len(all_inputs) <= 1:
# use trivial merge
return rules.super_fast_local_merge.output
else:
# fallback to slow merge
return rules.super_slow_merge.output
rule work_on_merged_data:
input: _choose_merged_file,
output: final_result_{sample}.txt
...
If I run something like the above here, then Snakemake complains that the rules are ambiguous. Is there any way to modify the _choose_merged_file
input function to overcome this limitation? Is there a different way to alias the rule I want directly?
Note: I've managed to get something working by making each implementation return a different filename (e.g. merged_slow_{sample}.txt
and merged_trivial_{sample}.txt
), but doing so essentially taints every rule that works on merged data with tedious input functions
If anyone can provide a recipe for changing the workflow graph dynamically, that would be great.