1
votes

In my workflow graph, I merge one or more input files based on a suffix. In the cases where there is only one file for a given suffix, the merge operation is trivial, and can be done locally. In the case where there are multiple such files to merge, the merge operation needs to run on the cluster and takes hours.

Is there any way to specify/select that a particular rule should be used, if I know of all the input files ahead of time? In other words, is there a way to express rule precedence based on the number of inputs to a rule?

def _all_files_for_the_sample(wildcards):
    # lookup all known files and return list
    # of files matching wildcards.sample
    ...

# these two rules effectively have the same structure.
# I am omitting the implementation
rule super_fast_local_merge:
    input: _all_files_for_the_sample
    output: merged_{sample}.txt
    ...

rule super_slow_merge:
    input: _all_files_for_the_sample
    output: merged_{sample}.txt
    ...

Now, I also have rules which perform computation on the outputs of either of the rules above. The manual mentions that when linking multiple chains of rules, it is more efficient to refer to the symbols from the rules global directly (e.g. stating rules.super_slow_merge.output as opposed to duplicating merged_{sample}.txt in a different rule). I was led to believe that by aliasing a particular rule's output, I would be able to influence the shape of the graph:

def _choose_merged_file(wildcards):
    all_inputs = _all_files_for_the_sample(wildcards.sample)
    if len(all_inputs) <= 1:
        # use trivial merge
        return rules.super_fast_local_merge.output
    else:
        # fallback to slow merge
        return rules.super_slow_merge.output

rule work_on_merged_data:
    input: _choose_merged_file,
    output: final_result_{sample}.txt
    ...

If I run something like the above here, then Snakemake complains that the rules are ambiguous. Is there any way to modify the _choose_merged_file input function to overcome this limitation? Is there a different way to alias the rule I want directly?

Note: I've managed to get something working by making each implementation return a different filename (e.g. merged_slow_{sample}.txt and merged_trivial_{sample}.txt), but doing so essentially taints every rule that works on merged data with tedious input functions

If anyone can provide a recipe for changing the workflow graph dynamically, that would be great.

1

1 Answers

1
votes

I am sorry to say that there is currently no reasonable way to achieve this. However there will be a solution in the near future, which is the job grouping feature. It will allow you, based on the wildcards, input files and so on, to decide whether some connected jobs shall be submitted together (to the same node on the cluster). This way you could group the fast merge job together with something more long running.