1
votes

I'm converting a bioinformatics pipeline into snakemake and have a script which loops over M files (where M=22 for each non-sex chromosome).

Each file essentially contains N label columns that I want as individual files. The python script does this reliably, my issue is that if I provide the snakefile with wildcards for the output (both chromosomes and labels) it will run the script MxN times whilst I only want it to run M times.

I can circumvent the problem by only looking for one label file per chromosome but this isn't keeping with the spirit of snakemake and the next step in the pipeline requires input from all label files.

I've already tried using the checkpoint feature (which as I understand, reevaluates the DAG after each rule is executed) to check the output, understand that N files have been generated and skip N jobs. But this crashes and I get this error. But because I know my labels beforehand, as I understand I shouldn't need checkpoint/dynamic - I just don't know exactly what I do need.

Is it possible to disable a wildcard from generating jobs and just check that the output is returned?

LABELS = ['A', 'B', 'C', 'D']
CHROMOSOMES = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22] 

rule all:
    input:
        "out/final.txt"

rule split_files: 
    '''
    Splits the chromosome files by label.
    '''
    input:
        "per_chromosome/myfile.{chromosome}.txt"
    output:
        "per_label/myfile.{label}.{chromosome}.txt"
    script:
        "scripts/split_files_snake.py"

rule make_out_file:
    '''
    Makes the final output file by checking each of label.chromosome files one-by-one
    '''
    input:
        expand("per_label/myfile.{label}.{chromosome}",
            label=LABELS,
            chromosome=CHROMOSOMES)
    output:
        "out/final.txt"
    script:
        "scripts/make_out_file_snake.py"
1

1 Answers

1
votes

If you wish to avoid the scropt being run N times, you can specify all your output files without a wildcard in the output:

    output:
        "per_label/myfile.A.{chromosome}.txt",
        "per_label/myfile.B.{chromosome}.txt",
        "per_label/myfile.C.{chromosome}.txt",
        "per_label/myfile.D.{chromosome}.txt"

To make the code more generic you can use the expand function but pay special attention to the braces in the format string:

    output:
        expand("per_label/myfile.{label}.{{chromosome}}.txt", label=LABELS)