This question follows on from a question I asked previously and it regards understanding how to access config files correctly using Snakemake. I have a specific problem I need to address which I'll ask first and a general problem understanding how indexing works which I'll ask second.
I'm using snakemake to run and ATAC-seq pipeline from Alignment/QC through to motif analysis.
A: Specific Question
I'm trying to add a rule called trim_galore_pe
to trim adapters from my fastq files before alignment and an error statement is thrown from snakemake as the names of the output files generated by trim galore
do not match what is expected by snakemake. This is because I cannot work out how to write the output file statement correctly in my snakemake file to make the names match.
An example of the names generated by TRIM GALORE
contain SRA numbers, for example:
trimmed_fastq_files/SRR2920475_1_val_1.fq.gz
Whereas the file expected by snakemake contain sample references and should read:
trimmed_fastq_files/Corces2016_4983.7B_Mono_1_val_1.fq.gz
This also affects the subsequent rules after the trim_galore_pe
rule. I need to work out a way to use the info in my config file to generate the output files required.
For all rules after those shown in the Snakefile I need files to be named by sample name i.e. Corces2016_4983.7A_Mono
. It would also be useful for all the FAST_QC
and MULTIQC
rules shown in the Snakefile below to have the sample names in the output file name structure, which they all already do in the current Snakefile.
However, inputs for Bowtie2, the FASTQC rules and input and output of the trim_galore_pe
rules need to contain the SRA numbers. The problem starts at the output of trim_galore
and influences all downstream rules.
Although I have extracted SRA numbers in previous rules, I'm not sure how to do this when not using the fastq_files
folder which is explicitly stated in the config file. By introducing the trim_galore_pe
rule I have effectively moved a new set of SRA files into the new trimmed_fastq_files
folder. How to extract only the SRA number from the list of SRA files config file containing the old folder names whilst referencing the new trimmed_fastq_files
folder in the Snakefile is the crux of my issue.
I hope this is clear.
Here is my config file:
samples:
Corces2016_4983.7A_Mono: fastq_files/SRR2920475
Corces2016_4983.7B_Mono: fastq_files/SRR2920476
cell_types:
Mono:
- Corces2016_4983.7A
index: /home/genomes_and_index_files/hg19
Here is my Snakefile:
# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])
rule all:
input:
expand("FastQC/PRETRIM/{sample}_{num}_fastqc.zip", sample=config["samples"], num=['1', '2']),
expand("bam_files/{sample}.bam", sample=config["samples"]),
"FastQC/PRETRIM/fastq_multiqc.html",
"FastQC/POSTTRIM/fastq_multiqc.html"
rule fastqc_pretrim:
input:
sample=lambda wildcards: f"{config['samples'][wildcards.sample]}_{wildcards.num}.fastq.gz"
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/PRETRIM/{sample}_{num}_fastqc.html",
zip="FastQC/PRETRIM/{sample}_{num}_fastqc.zip"
wrapper:
"0.23.1/bio/fastqc"
rule multiqc_fastq_pretrim:
input:
expand("FastQC/PRETRIM/{sample}_{num}_fastqc.html", sample=config["samples"], num=['1', '2'])
output:
"FastQC/PRETRIM/fastq_multiqc.html"
wrapper:
"0.23.1/bio/multiqc"
rule trim_galore_pe:
input:
sample=lambda wildcards: expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
output:
"trimmed_fastq_files/{sample}_1_val_1.fq.gz",
"trimmed_fastq_files/{sample}_1.fastq.gz_trimming_report.txt",
"trimmed_fastq_files/{sample}_2_val_2.fq.gz",
"trimmed_fastq_files/{sample}_2.fastq.gz_trimming_report.txt"
params:
extra="--illumina -q 20"
log:
"logs/trim_galore/{sample}.log"
wrapper:
"0.23.1/bio/trim_galore/pe"
rule fastqc_posttrim:
input:
"trimmed_fastq_files/{sample}_1_val_1.fq.gz", "trimmed_fastq_files/{sample}_2_val_2.fq.gz"
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/POSTTRIM/{sample}_{num}_fastqc.html",
zip="FastQC/POSTTRIM/{sample}_{num}_fastqc.zip"
wrapper:
"0.23.1/bio/fastqc"
rule multiqc_fastq_posttrim:
input:
expand("FastQC/POSTTRIM/{sample}_{num}.trim_fastqc.html", sample=config["samples"], num=['1', '2'])
output:
"FastQC/POSTTRIM/fastq_multiqc.html"
wrapper:
"0.23.1/bio/multiqc"
rule bowtie2:
input:
"trimmed_fastq_files/{sample}_1_val_1.fq.gz", "trimmed_fastq_files/{sample}_2_val_2.fq.gz"
output:
"bam_files/{sample}.bam"
log:
"logs/bowtie2/{sample}.txt"
params:
index=config["index"], # prefix of reference genome index (built with bowtie2-build),
extra=""
threads: 8
wrapper:
"0.23.1/bio/bowtie2/align"
This currently runs, and give a full job list using snakemake -np
, but throws the error mentioned above.
B: General question
Is there an online resource that explains succinctly how to reference a config file using python, particularly with reference to snakemake? The online docs are pretty insufficient and assume prior knowledge of python.
My programming experience is mainly in bash and R but I like Snakemake and do generally understand how dictionaries and lists work in python and how to reference items stored within them. However I find the complex use of bracketing, wildcards and inverted commas in some of the Snakemake rules above confusing so tend to struggle when trying to reference different parts of file names in the config file. I want to understand fully how to utilise these elements.
For example, in a rule such as this from the Snakefile posted above:
sample=lambda wildcards: expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
What is actually happening in this command? My understanding is that we are accessing the config file using config['samples']
and we are using the [wildcards.sample]
part to explicitly access the fastq_files/SRR2920475
part of the config file. The expand allows us to iterate through each item in the config file that fit the parameters in the command, i.e all the SRA files, and the lambda wildcards is needed to use the wildcards
call in the command. What I'm uncertain about is:
- What does the
f
do just after the expand and why is it needed? - Why does
config['samples']
contain inverted commas within the square brackets but but inverted commas are not needed around[wildcards.sample]
? - Why are the single and double curly brackets used?
- Looking at the Snakefile above, a few of the rules contain parts assigning a sequence of numbers to
num
, but again these numbers are sometimes enclosed around inverted commas and sometimes not...why?
Any advice, tips, pointers would be greatly appreciated.
C: Clarification on suggestions made below by @bli
I have edited my config file as you suggested in your comment and omitted the folder names leaving only the SRA numbers. This make sense to me, but I have a couple of other issues preventing me getting this Snakefile running.
New config file:
samples:
Corces2016_4983.7A_Mono: SRR2920475
Corces2016_4983.7B_Mono: SRR2920476
cell_types:
Mono:
- Corces2016_4983.7A
index: /home/c1477909/genomes_and_index_files/hg19
New Snakefile:
# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])
rule all:
input:
expand("FastQC/PRETRIM/{sample}_{num}_fastqc.zip", sample=config["samples"], num=['1', '2']),
expand("bam_files/{sample}.bam", sample=config["samples"]),
"FastQC/PRETRIM/fastq_multiqc.html",
"FastQC/POSTTRIM/fastq_multiqc.html",
rule fastqc_pretrim:
input:
lambda wildcards: f"fastq_files/{config['samples'][wildcards.sample]}_{wildcards.num}.fastq.gz"
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/PRETRIM/{sample}_{num}_fastqc.html",
zip="FastQC/PRETRIM/{sample}_{num}_fastqc.zip"
wrapper:
"0.23.1/bio/fastqc"
rule multiqc_fastq_pretrim:
input:
expand("FastQC/PRETRIM/{sample}_{num}_fastqc.html", sample=config["samples"], num=['1', '2'])
output:
"FastQC/PRETRIM/fastq_multiqc.html"
wrapper:
"0.23.1/bio/multiqc"
rule trim_galore_pe:
input:
lambda wildcards: expand(f"fastq_files/{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
output:
"trimmed_fastq_files/{wildcards.sample}_1_val_1.fq.gz",
"trimmed_fastq_files/{wildcards.sample}_1.fastq.gz_trimming_report.txt",
"trimmed_fastq_files/{wildcards.sample}_2_val_2.fq.gz",
"trimmed_fastq_files/{wildcards.sample}_2.fastq.gz_trimming_report.txt"
params:
extra="--illumina -q 20"
log:
"logs/trim_galore/{sample}.log"
wrapper:
"0.23.1/bio/trim_galore/pe"
rule fastqc_posttrim:
input:
lambda wildcards: expand(f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}_val_{{num}}.fq.gz", num=[1,2])
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/POSTTRIM/{sample}_{num}_fastqc.html",
zip="FastQC/POSTTRIM/{sample}_{num}_fastqc.zip"
wrapper:
"0.23.1/bio/fastqc"
rule multiqc_fastq_posttrim:
input:
expand("FastQC/POSTTRIM/{sample}_{num}.trim_fastqc.html", sample=config["samples"], num=['1', '2'])
output:
"FastQC/POSTTRIM/fastq_multiqc.html"
wrapper:
"0.23.1/bio/multiqc"
rule bowtie2:
input:
lambda wildcards: expand(f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}_val_{{num}}.fq.gz", num=[1,2])
output:
"bam_files/{sample}.bam"
log:
"logs/bowtie2/{sample}.txt"
params:
index=config["index"], # prefix of reference genome index (built with bowtie2-build),
extra=""
threads: 8
wrapper:
"0.23.1/bio/bowtie2/align"
Using these new files everything worked fine initially, a partial job list was created by snakemake -np
. However, this is because half of the complete job list had already been run; that is the trimmed_fastq_files
folder was generated and the correctly named trimmed fastq files were in place within it. When I deleted all the previously created files to see if the entire new version of the Snakefile would work properly, snakemake -np
failed, stating that there were missing input files for the rules downstream of the trim_galore_pe
rule.
As you can see I'm trying to call the {wildcard.sample}
variable set in the input section of the trim_galore_pe
rule in the output section, but snakemake doesn't like this. Is is possible to do this?
I also tried this using the tips from the answers below but this didn't work either:
rule trim_galore_pe:
input:
sample=lambda wildcards: expand(f"fastq_files/{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
output:
expand(f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}_val_{{num}}.fq.gz", num=[1,2]),
expand(f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}.fastq.gz_trimming_report.txt", num=[1,2])
params:
extra="--illumina -q 20"
log:
"logs/trim_galore/{sample}.log"
wrapper:
"0.23.1/bio/trim_galore/pe"
The error then stated wildcards not defined
. So, logically I tried putting lambda wildcards:
in front of the two expand statements of the output section in an attempt to define the wildcards, but this threw a syntax error, Only input files can be specified as functions
. I also tried using some of the indexing suggestions below but couldn't get the right combination.
This is probably caused by another thing I'm unsure about regarding Snakefiles and that is how scoping works.
- If I define a variable in
rule all
can all other rules access it? - If I define a variable in the input section of a rule, is it available to all other sections of that rule (i.e. output, shell command etc.), but only that rule?
- If yes, why can't I access the
{wildcard.sample}
variable if I defined it in the input section? Is that because that variable is enclosed within a 'closed' scope lambda function?
Any (further) suggestions would be greatly appreciated.
output
section, you don't use{wildcards.sample}
, you use directly{sample}
. Snakemake already assumes you are talking about wildcards attributes. Actually, the wildcards attributes that exist in the scope of the rule are determined by parsing theoutput
section. In case this can be useful, I tried to write some explanations of my current understanding on the subject: bitbucket.org/blaiseli/snakemake/src/… – bli