3
votes

I have a list of samples going through Snakemake. When I arrive at my fastqc step, I suddenly have two files per sample (an R1 and R2 file). Consider the following rule:

rule fastqc:
    input:
        os.path.join(fastq_dir, '{sample}_R1_001.fastq.gz'),
        os.path.join(fastq_dir, '{sample}_R2_001.fastq.gz')
    output:
        os.path.join(fastq_dir, '{sample}_R1_fastq.html'),
        os.path.join(fastq_dir, '{sample}_R2_fastq.html')
    conda:
        "../envs/fastqc.yaml"
    shell:
        '''
        #!/bin/bash
        fastqc {input} --outdir={fastqc_dir}
        '''

This does not work. I also tried the following:

rule fastqc:
    input:
        expand([os.path.join(fastq_dir, '{sample}_R{read}_001.fastq.gz')], read=['1', '2']
    output:
        expand([os.path.join(fastq_dir, '{sample}_R{read}_fastq.html')], read=['1', '2']
    conda:
        "../envs/fastqc.yaml"
    shell:
        '''
        #!/bin/bash
        fastqc {input} --outdir={fastqc_dir}
        '''

Which also does not work, I get:

No values given for wildcard 'sample'.

Then I tried:

rule fastqc:
    input:
        expand([os.path.join(fastq_dir, '{sample}_R{read}_001.fastq.gz')], read=['1', '2'], sample=samples['samples'])
    output:
        expand([os.path.join(fastqc_dir, '{sample}_R{read}_fastqc.html')], read=['1', '2'], sample=samples['samples'])
    conda:
        "../envs/fastqc.yaml"
    shell:
        '''
        #!/bin/bash
        fastqc {input} --outdir={fastqc_dir}
        '''

But this feeds all fastq files there are into one shell script, it seems.

How should I properly "loop" over multiple inputs for 1 sample?

Highest regards.

Edit:

My rule all looks like this, probably I should also change that, right (see the last 2 lines for fastqc)?

# Rule all is a pseudo-rule that tells snakemake what final files to generate.
rule all:
    input:
        expand([os.path.join(analyzed_dir, '{sample}.genes.results'),
                os.path.join(rseqc_dir, '{sample}.bam_stat.txt'),
                os.path.join(rseqc_dir, '{sample}.clipping_profile.xls'),
                os.path.join(rseqc_dir, '{sample}.deletion_profile.txt'),
                os.path.join(rseqc_dir, '{sample}.infer_experiment.txt'),
                os.path.join(rseqc_dir, '{sample}.geneBodyCoverage.txt'),
                os.path.join(rseqc_dir, '{sample}.inner_distance.txt'),
                os.path.join(rseqc_dir, '{sample}.insertion_profile.xls'),
                os.path.join(rseqc_dir, '{sample}.junction.xls'),
                os.path.join(rseqc_dir, '{sample}.junctionSaturation_plot.r'),
                os.path.join(rseqc_dir, '{sample}.mismatch_profile.xls'),
                os.path.join(rseqc_dir, '{sample}.read_distribution.txt'),
                os.path.join(rseqc_dir, '{sample}.pos.DupRate.xls'),
                os.path.join(rseqc_dir, '{sample}.seq.DupRate.xls'),
                os.path.join(rseqc_dir, '{sample}.GC.xls'),
                os.path.join(rseqc_dir, '{sample}.NVC.xls'),
                os.path.join(rseqc_dir, '{sample}.qual.r'),
                os.path.join(rseqc_dir, '{sample}.RNA_fragment_size.txt'),
                os.path.join(rseqc_dir, '{sample}.STAR.genome.sorted.summary.txt'),
                os.path.join(fastq_dir, '{sample}_R1_fastq.html'),
                os.path.join(fastq_dir, '{sample}_R2_fastq.html')],
                sample=samples['samples'])
1
Didn't know that existed, I added bioinformatics as a tag to all my posts, can you use that to port?Freek
Unfortunately no. Let’s leave it here for now, that’s fine too. There’s simply a higher chance of being seen by somebody knowledgeable on the other site.Konrad Rudolph
Ok, I'll post there from now on. As long as @JeeYem visits it, I should be fine ;)Freek

1 Answers

2
votes

Yes, this one I figured out by "myself". The magic is in the "rule all" part.

This combination of rules works:

reads = ['1', '2']

# Rule all is a pseudo-rule that tells snakemake what final files to generate.
rule all:
    input:
        expand([os.path.join(analyzed_dir, '{sample}.genes.results'),
                os.path.join(rseqc_dir, '{sample}.bam_stat.txt'),
                os.path.join(rseqc_dir, '{sample}.clipping_profile.xls'),
                os.path.join(rseqc_dir, '{sample}.deletion_profile.txt'),
                os.path.join(rseqc_dir, '{sample}.infer_experiment.txt'),
                os.path.join(rseqc_dir, '{sample}.geneBodyCoverage.txt'),
                os.path.join(rseqc_dir, '{sample}.inner_distance.txt'),
                os.path.join(rseqc_dir, '{sample}.insertion_profile.xls'),
                os.path.join(rseqc_dir, '{sample}.junction.xls'),
                os.path.join(rseqc_dir, '{sample}.junctionSaturation_plot.r'),
                os.path.join(rseqc_dir, '{sample}.mismatch_profile.xls'),
                os.path.join(rseqc_dir, '{sample}.read_distribution.txt'),
                os.path.join(rseqc_dir, '{sample}.pos.DupRate.xls'),
                os.path.join(rseqc_dir, '{sample}.seq.DupRate.xls'),
                os.path.join(rseqc_dir, '{sample}.GC.xls'),
                os.path.join(rseqc_dir, '{sample}.NVC.xls'),
                os.path.join(rseqc_dir, '{sample}.qual.r'),
                os.path.join(rseqc_dir, '{sample}.RNA_fragment_size.txt'),
                os.path.join(rseqc_dir, '{sample}.STAR.genome.sorted.summary.txt'),
                os.path.join(fastqc_dir, '{sample}_R{read}_001_fastqc.html')],
                sample=samples['samples'], read=reads)

Notice the simple addition of {read} to the otherwise the same fastqc part and the definition or "reads" at the top (samples is a standard samples list).

The I use this fastqc rule:

rule fastqc:
    input:
       os.path.join(fastq_dir, '{sample}_R{read}_001.fastq.gz')
    output:
        os.path.join(fastqc_dir, '{sample}_R{read}_001_fastqc.html')
    conda:
        "../envs/fastqc.yaml"
    shell:
        '''
        #!/bin/bash
        fastqc {input} --outdir={fastqc_dir}
        '''

It has the same line as the "rule all" (as usual). This works, thanx for the upvotes, Freek out.