I want to loop snakemake above two different wildcards, which - I think - are somehow independent from each other.
In case that there is already a solved threat for this case, I would be happy for a hint. But so far I'm not sure what the correct terms are to look for what I want to do.
Let's assume my pipeline has three steps. I have a set of samples which I process in each of those three steps. Put in the second step I deploy an extra parameter to every sample. In the third step now I have to iterate through the samples and the associated parameter. Because of this structure, I think it's not possible to solve this with a dictionary structure.
To picture the case I made this simplification of my files and rules: The config file:
config.yaml
samples:
- a
- b
- c
- d
threshhold:
- 0.5
- 0.1
- 0.2
A scheme of my snakefile. It pictures the exact structure and naming of snakemake operations. At least the naming is simplified. (I added the tools I actually use in brackets but I think that's not essential for understanding.)
rule all:
input:
expand("{sample}.bam", sample=config["samples"]),
expand("{sample}_{param}.bed", sample=config["samples"], param=config["threshhold"])
rule first: # (samtools view)
input:
"{sample}.sam"
output:
"{sample}.bam"
shell:
"<somecommand> {input} {output}"
rule second: # ( macs2 callpeaks; Of course, there are multiple outputs but they only vary in their suffix))
input:
"{sample}.bam"
output:
"{sample}_{param}.bed"
params:
out_name="{sample}",
threshhold="{param}"
shell:
"<somecommand> {input} -n {params.names} -q {params.threshhold}"
So now I have a list of files looking like this:
- a_0.5.bed
- a_0.1.bed
- a_0.2.bed
- b_0.5.bed
- b_0.1.bed
- b_0.2.bed
- ...
In my third rule I want to do intersections of different samples with the same param. Like: a_0.5.bed x b_0.5.bed and c_0.5.bed x d_0.5.bed and get the output like ab_0.5.bed, ab_0.1.bed, cd_0.5.bed...
My first attempt was this:
rule all:
input:
expand("ab_{param}.bed", param=config["threshhold"])
rule intersect_2: # (bedtools intersect)
input:
a=expand("{sample_a}_{param}_peaks.narrowPeak", sample_a=config["samples"][0], param=config["threshhold"]),
b=expand("{sample_b}_{param}_peaks.narrowPeak", sample_b=config["samples"][1], param=config["threshhold"])
output:
ab="intersect/ab_{param}.bed"
params:
threshhold="{param}"
shell:
"bedtools intersect -u -a {input.a} -b {input.b} > {output.ab}"
Well this doesn't work because now the input are all the different parameters files at once.
I think I need more and a different loop structure here. Maybe even some extra python loops around the rule or something? But since I have no programming experience at all and I'm just getting started to go into those things step by step, I couldn't figure out by now where to start or which loop is required for this.
Summary: With the given config file I'd like to archive a folder wich is filled with different combinations of the samples with the same parameter. So to end up with a list which looks like this:
- ab_0.5.bed
- ba_0.5.bed
- cb_0.5.bed
- ca_0.5.bed
- abc_0.5.bed
- bca_0.5.bed
- cba_0.5.bed
And those combinations too, for all the other parameters.
I would really appreciate any help and every hint that helps to understand what I exactly want to do there and how I can build this.
EDIT: Would maybe a completely restructured config file would help? Where the samples are already pre-combined? Maybe like this: (let's assume that s1, s2 and so on are standing for the real (and long) sample name)
config.yaml
samples_combinations:
- s1 : s2
- s3 : s2
- s3 : s1
I still have to rename it... But I don't really like the idea. My aim is to build something which is easily applicable and straight forward without as much manual refinement, Especially because you can have way more than just three samples in this case which have to be combined in multiple ways.