3
votes

I was wondering if there is a way to have optional inputs in rules. An example case is excluding unpaired reads for alignment (or having only unpaired reads). A pseudo rule example:

rule hisat2_align:
    input:
        rU: lambda wildcards: ('-U '+ read_files[wildcards.reads]['unpaired']) if wildcards.read_type=='trimmed' else '',
        r1: lambda wildcards: '-1 '+ read_files[wildcards.reads]['R1'],
        r2: lambda wildcards: '-2 '+ read_files[wildcards.reads]['R2']
    output:
        'aligned.sam'
    params:
        idx: 'index_prefix',
        extra: ''
    shell:
        'hisat2 {params.extra} -x {params.idx} {input.rU} {input.r1} {input.r2}'

Here, not having trimmed reads (rU='') would result in missing input file error. I can go around this through a duplicate rule with adjusted input/shell statement or handling the input through params (i'm sure there are other ways). I'm trying to handle this neatly so that this step can be run through a snakemake wrapper (currently a custom one).

The closest example I've seen is on https://groups.google.com/d/msg/snakemake/qX7RfXDTDe4/XTMOoJpMAAAJ and Johannes' answer. But there we have a conditional assignment (eg. input: 'a' if condition else 'b') not an optional one.

Any help/guidance will be appreciated.

ps. optional input can help with varying number of hisat2 indexes as well (as noted here: https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/hisat2.html).

EDIT

To clarify the potential inputs:

1) Use single-end reads alone and declare them in rU. Reads files for the sample might be

sample1_single_1.fastq.gz
sample1_single_2.fastq.gz

In this case r1 and r2 maybe empty lists or not declared at all in the rule.

2) Use paired-end reads and declared them in r1 and r2. Reads files for the sample might be

sample1_paired_1_R1.fastq.gz
sample1_paired_1_R2.fastq.gz
sample1_paired_2_R1.fastq.gz
sample1_paired_2_R2.fastq.gz

In this case `rU`` maybe empty list or not declared at all in the rule.

3) Use paired and single-end reads together (e.g. output from trimmomatic where some pairs are broken). Reads files for the sample might be

sample1_paired_1_R1.fastq.gz
sample1_paired_1_R2.fastq.gz
sample1_paired_2_R1.fastq.gz
sample1_paired_2_R2.fastq.gz
sample1_unpaired_1_R1.fastq.gz
sample1_unpaired_1_R2.fastq.gz
sample1_unpaired_2_R1.fastq.gz
sample1_unpaired_2_R2.fastq.gz

As a solution. I ended up using @timofeyprodanov approach. I didn't realize an empty list can be used for this. Thanks for all the answers and comments!

3

3 Answers

2
votes

One solution would be to pass the endedness info through the output filename or path. Something like the following should work with the existing wrapper:

def get_fastq_reads(wcs):
    if wcs.endedness == 'PE':  # Paired-end
        return ["reads/{wcs.sample}.1.fastq.gz", "reads/{wcs.sample}.2.fastq.gz"]

    if wcs.endedness == 'SE':  # Single-end
        return ["reads/{wcs.sample}.fastq.gz"]

    raise(ValueError("Unrecognized wildcard value for 'endedness': %s" % wcs.endedness))

rule hisat2:
    input:
      reads=get_fastq_reads
    output:
      "mapped/{sample}.{endedness}.bam"
    log:                                # optional
      "logs/hisat2/{sample}.{endedness}.log"
    params:                             # idx is required, extra is optional
      idx="genome",
      extra="--min-intronlen 1000"
    wildcard_constraints:
        endedness="(SE|PE)"
    threads: 8                          # optional, defaults to 1
    wrapper:
      "0.27.1/bio/hisat2"

With this single rule, one could then map reads/tardigrade.fastq.gz with

> snakemake mapped/tardigrade.SE.bam

or reads/tardigrade.{1,2}.fastq.gz with

> snakemake mapped/tardigrade.PE.bam

Note on the Index Note

I'm confused by the note on the index files, and think it may be wrong. HISAT2 doesn't accept files for that argument, but instead a single prefix that all index files have in common, so there should be only ever one argument for that. The example, idx="genome.fa", in the documentation is misleading. The index that results from building the toy reference (22_20-21M.fa) that comes with HISAT2 is 22_20-21M_snp.{1..8}.ht2, in which case one would use idx="22_20-21M_snp".

2
votes

I usually do it using expand with either empty or non-empty list:

rule a:
    input:
        expand('filename', proxy=[] if no_input else [None])
0
votes

I think mere's approach, i.e. to let the output file name contain the information about the endedness, is the most natural one in Snakemake. An alternative way that requires you to have duplicate rules but not conditional statements is to use the ruleorder directive, e.g. ruleorder: align_pe > align_se. Then the higher priority rule will be used if the optional input exists.