I have a large number of input files organized like this:
data/
├── set1/
│ ├── file1_R1.fq.gz
│ ├── file1_R2.fq.gz
│ ├── file2_R1.fq.gz
│ ├── file2_R2.fq.gz
| :
│ └── fileX_R2.fq.gz
├── another_set/
│ ├── asdf1_R1.fq.gz
│ ├── asdf1_R2.fq.gz
│ ├── asdf2_R1.fq.gz
│ ├── asdf2_R2.fq.gz
| :
│ └── asdfX_R2.fq.gz
:
└── many_more_sets/
├── zxcv1_R1.fq.gz
├── zxcv1_R2.fq.gz
:
└── zxcvX_R2.fq.gz
If you are familiar with bioinformatics - these are of course fastq files from paired end sequencing runs. I am trying to generate a snakemake workflow that reads all of those and I'm already failing at the first rule. This is my attempt:
configfile: "config.yaml"
rule all:
input:
read1=expand("{output}/clipped_and_trimmed_reads/{{sample}}_R1.fq.gz", output=config["output"]),
read2=expand("{output}/clipped_and_trimmed_reads/{{sample}}_R2.fq.gz", output=config["output"])
rule clip_and_trim_reads:
input:
read1=expand("{data}/{set}/{{sample}}_R1.fq.gz", data=config["raw_data"], set=config["sets"]),
read2=expand("{data}/{set}/{{sample}}_R2.fq.gz", data=config["raw_data"], set=config["sets"])
output:
read1=expand("{output}/clipped_and_trimmed_reads/{{sample}}_R1.fq.gz", output=config["output"]),
read2=expand("{output}/clipped_and_trimmed_reads/{{sample}}_R2.fq.gz", output=config["output"])
threads: 16
shell:
"""
someTool -o {output.read1} -p {output.read2} \
{input.read1} {input.read2}
"""
I cannot specify clip_and_trim_reads
as a target, because Target rules may not contain wildcards.
I tried adding the all
rule, but this gives me another error:
$ snakemake -np
Building DAG of jobs...
WildcardError in line 3 of /work/project/Snakefile:
Wildcards in input files cannot be determined from output files:
'sample'
I also tried using the dynamic()
function for the all
rule, which weirdly did find the files, but still gave me this error:
$ snakemake -np
Dynamic output is deprecated in favor of checkpoints (see docs). It will be removed in Snakemake 6.0.
Building DAG of jobs...
MissingInputException in line 7 of /work/project/ladsie_002/analyses/scripts/2019-08-02_splice_leader_HiC/Snakefile:
Missing input files for rule clip_and_trim_reads:
data/raw_data/set1/__snakemake_dynamic___R1.fq.gz
data/raw_data/set1/__snakemake_dynamic___R2.fq.gz
data/raw_data/set1/__snakemake_dynamic___R2.fq.gz
data/raw_data/set1/__snakemake_dynamic___R1.fq.gz
[...]
I have over a hundred different files, so I would very much like to avoid creating a list with all filenames. Any ideas how to achieve that?