0
votes

Snakemake is super-confusing to me. I have files of the form:

indir/type/name_1/run_1/name_1_processed.out
indir/type/name_1/run_2/name_1_processed.out
indir/type/name_2/run_1/name_2_processed.out
indir/type/name_2/run_2/name_2_processed.out

where type, name, and the numbers are variable. I would like to aggregate files such that all files with the same "name" end up in a single dir:

outdir/type/name/name_1-1.out
outdir/type/name/name_1-2.out
outdir/type/name/name_2-1.out
outdir/type/name/name_2-2.out

How do I write a snakemake rule to do this? I first tried the following

rule rename:
    input:
        "indir/{type}/{name}_{nameno}/run_{runno}/{name}_{nameno}_processed.out"
    output:
        "outdir/{type}/{name}/{name}_{nameno}-{runno}.out"
    shell:
        "cp {input} {output}"

# example command: snakemake --cores 1 outdir/type/name/name_1-1.out

This worked, but doing it this way doesn't save me any effort because I have to know what the output files are ahead of time, so basically I'd have to pass all the output files as a list of arguments to snakemake, requiring a bit of shell trickery to get the variables.

So then I tried to use directory (as well as give up on preserving runno).

rule rename2:
    input:
        "indir/{type}/{name}_{nameno}"
    output:
        directory("outdir/{type}/{name}")
    shell:
        """
        for d in {input}/run_*; do
          i=0
          for f in ${{d}}/*processed.out; do
            cp ${{f}} {output}/{wildcards.name}_{wildcards.nameno}-${{i}}.out
          done
          let ++i
        done
        """

This gave me the error, Wildcards in input files cannot be determined from output files: 'nameno'. I get it; {nameno} doesn't exist in output. But I don't want it there in the directory name, only in the filename that gets copied.

Also, if I delete {nameno}, then it complains because it can't find the right input file.

What are the best practices here for what I'm trying to do? Also, how does one wrap their head around the fact that in snakemake, you specify outputs, not inputs? I think this latter fact is what is so confusing.

1

1 Answers

1
votes

I guess what you need is the expand function:

rule all:
    input: expand("outdir/{type}/{name}/{name}_{nameno}-{runno}.out",
                  type=TYPES,
                  name=NAMES,
                  nameno=NAME_NUMBERS,
                  runno=RUN_NUMBERS)

The TYPES, NAMES, NAME_NUMBERS and RUN_NUMBERS are the lists of all possible values for these parameters. You either need to hardcode or use the glob_wildcards function to collects these data:

TYPES, NAMES, NAME_NUMBERS, RUN_NUMBERS, = glob_wildcards("indir/{type}/{name}_{nameno}/run_{runno}/{name}_{nameno}_processed.out")

This however would give you duplicates. If that is not desireble, remove the duplicates:

TYPES, NAMES, NAME_NUMBERS, RUN_NUMBERS, = map(set, glob_wildcards("indir/{type}/{name}_{nameno}/run_{runno}/{name}_{nameno}_processed.out"))