snakemake: optional input for rules

Question

I was wondering if there is a way to have optional inputs in rules. An example case is excluding unpaired reads for alignment (or having only unpaired reads). A pseudo rule example:

rule hisat2_align:
    input:
        rU: lambda wildcards: ('-U '+ read_files[wildcards.reads]['unpaired']) if wildcards.read_type=='trimmed' else '',
        r1: lambda wildcards: '-1 '+ read_files[wildcards.reads]['R1'],
        r2: lambda wildcards: '-2 '+ read_files[wildcards.reads]['R2']
    output:
        'aligned.sam'
    params:
        idx: 'index_prefix',
        extra: ''
    shell:
        'hisat2 {params.extra} -x {params.idx} {input.rU} {input.r1} {input.r2}'

Here, not having trimmed reads (rU='') would result in missing input file error. I can go around this through a duplicate rule with adjusted input/shell statement or handling the input through params (i'm sure there are other ways). I'm trying to handle this neatly so that this step can be run through a snakemake wrapper (currently a custom one).

The closest example I've seen is on https://groups.google.com/d/msg/snakemake/qX7RfXDTDe4/XTMOoJpMAAAJ and Johannes' answer. But there we have a conditional assignment (eg. input: 'a' if condition else 'b') not an optional one.

Any help/guidance will be appreciated.

ps. optional input can help with varying number of hisat2 indexes as well (as noted here: https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/hisat2.html).

EDIT

To clarify the potential inputs:

1) Use single-end reads alone and declare them in rU. Reads files for the sample might be

sample1_single_1.fastq.gz
sample1_single_2.fastq.gz

In this case r1 and r2 maybe empty lists or not declared at all in the rule.

2) Use paired-end reads and declared them in r1 and r2. Reads files for the sample might be

sample1_paired_1_R1.fastq.gz
sample1_paired_1_R2.fastq.gz
sample1_paired_2_R1.fastq.gz
sample1_paired_2_R2.fastq.gz

In this case `rU`` maybe empty list or not declared at all in the rule.

3) Use paired and single-end reads together (e.g. output from trimmomatic where some pairs are broken). Reads files for the sample might be

sample1_paired_1_R1.fastq.gz
sample1_paired_1_R2.fastq.gz
sample1_paired_2_R1.fastq.gz
sample1_paired_2_R2.fastq.gz
sample1_unpaired_1_R1.fastq.gz
sample1_unpaired_1_R2.fastq.gz
sample1_unpaired_2_R1.fastq.gz
sample1_unpaired_2_R2.fastq.gz

As a solution. I ended up using @timofeyprodanov approach. I didn't realize an empty list can be used for this. Thanks for all the answers and comments!

merv merv · Accepted Answer · 2018-07-20T04:42:20

One solution would be to pass the endedness info through the output filename or path. Something like the following should work with the existing wrapper:

def get_fastq_reads(wcs):
    if wcs.endedness == 'PE':  # Paired-end
        return ["reads/{wcs.sample}.1.fastq.gz", "reads/{wcs.sample}.2.fastq.gz"]

    if wcs.endedness == 'SE':  # Single-end
        return ["reads/{wcs.sample}.fastq.gz"]

    raise(ValueError("Unrecognized wildcard value for 'endedness': %s" % wcs.endedness))

rule hisat2:
    input:
      reads=get_fastq_reads
    output:
      "mapped/{sample}.{endedness}.bam"
    log:                                # optional
      "logs/hisat2/{sample}.{endedness}.log"
    params:                             # idx is required, extra is optional
      idx="genome",
      extra="--min-intronlen 1000"
    wildcard_constraints:
        endedness="(SE|PE)"
    threads: 8                          # optional, defaults to 1
    wrapper:
      "0.27.1/bio/hisat2"

With this single rule, one could then map reads/tardigrade.fastq.gz with

> snakemake mapped/tardigrade.SE.bam

or reads/tardigrade.{1,2}.fastq.gz with

> snakemake mapped/tardigrade.PE.bam

Note on the Index Note

I'm confused by the note on the index files, and think it may be wrong. HISAT2 doesn't accept files for that argument, but instead a single prefix that all index files have in common, so there should be only ever one argument for that. The example, idx="genome.fa", in the documentation is misleading. The index that results from building the toy reference (22_20-21M.fa) that comes with HISAT2 is 22_20-21M_snp.{1..8}.ht2, in which case one would use idx="22_20-21M_snp".

snakemake: optional input for rules

EDIT

3 Answers

Note on the Index Note