0
votes

I tried to create a snakemake file to run sortmeRNA pipeline:

SAMPLES = ['test']
READS=["R1", "R2"]

rule all:
    input: expand("Clean/4.Unmerge/{exp}.non_rRNA_{read}.fastq", exp = SAMPLES, read = READS)

rule unzip:
    input: 
        fq = "trimmed/{exp}.{read}.trimd.fastq.gz"
    output: 
        ofq = "Clean/1.Unzipped/{exp}.{read}.trimd.fastq"
    shell: "gzip -dkc < {input.fq} > {output.ofq}"

rule merge_paired:
    input: 
        read1 = "Clean/1.Unzipped/{exp}.R1.trimd.fastq",
        read2 = "Clean/1.Unzipped/{exp}.R2.trimd.fastq"
    output: 
        il = "Clean/2.interleaved/{exp}.il.trimd.fastq"
    shell: "merge-paired-reads.sh {input.read1} {input.read2} {output.il}"

rule sortmeRNA:
    input: 
        ilfq = "Clean/2.interleaved/{exp}.il.trimd.fastq"
    output:
        reads_rRNA = "Clean/3.sorted/{exp}_reads_rRNA",
        non_rRNA = "Clean/3.sorted/{exp}_reads_nonRNA"
    params:
        silvabac = "rRNA_databases/silva-bac-16s-id90.fasta,index/silva-bac-16s-db:rRNA_databases/silva-bac-23s-id98.fasta,index/silva-bac-23s-db",
        silvaarc = "rRNA_databases/silva-arc-16s-id95.fasta,index/silva-arc-16s-db:rRNA_databases/silva-arc-23s-id98.fasta,index/silva-arc-23s-db",
        silvaeuk = "rRNA_databases/silva-euk-18s-id95.fasta,index/silva-euk-18s-db:rRNA_databases/silva-euk-28s-id98.fasta,index/silva-euk-28s-db",
        rfam = "rRNA_databases/rfam-5s-database-id98.fasta,index/rfam-5s-db:rRNA_databases/rfam-5.8s-database-id98.fasta,index/rfam-5.8s-db",
        acc = "--num_alignments 1 --fastx --log -a 20 -m 64000 --paired_in -v"
    log:
        "Clean/sortmeRNAlogs/{exp}_sortmeRNA.log"
        shell:'''
        sortmerna --ref {params.silvabac}:{params.silvaarc}:{params.silvaeuk}:{params.rfam} --reads {input.ilfq} --aligned {output.reads_rRNA} --other {output.non_rRNA} {params.acc}
        '''
rule unmerge_paired:
    input:
        inun = "Clean/3.sorted/{exp}_reads_nonRNA.fastq"
    output:
        R1 = "Clean/4.Unmerge/{exp}.non_rRNA_R1.fastq",
        R2 = "Clean/4.Unmerge/{exp}.non_rRNA_R2.fastq"
    shell:"unmerge-paired-reads.sh {input.inun} {output.R1} {output.R2}"

This worked fine !. But for 1 sample it produced an output with size ~53 GB. I have 90 samples to run and cannot afford huge disk space. I tried to make output of rules unzip,merge_paired,sortmeRNA as temp(), but upon executing unmerge_paired raises "Missing input files exception" error. I also tried to add rule_remove to delete all those intermediate directories. But that is not executed as last rule, rather somewhere in the middle raising error again !. Is there any efficient way to do this ?

The error that occurs is:

MissingInputException in line 45 of sortmeRNA_pipeline_memv2.0.snakefile:
Missing input files for rule unmerge_paired:
Clean/3.sorted/test_reads_nonRNA.fastq

Also please note that, rule sortmeRNA requires a string for output and produces string.fastq file, which is then input into rule unmerge_paired ! Thanks.

1
Can you give us the full missing input files exception? Since the temp() approach should work.Maarten-vd-Sande
edited in question !. My guess is that, temp clears the folder mid-way before the execution of unmerge_paired ?Arun

1 Answers

2
votes

For Snakemake to connect the input of one rule to the output of another, they will need to be identical. The way you describe the output of sortmeRNA and the input unmerge_paired do not work together, whether you put temp() around it or not.

rule sortmeRNA:
    input: 
        ilfq = "Clean/2.interleaved/{exp}.il.trimd.fastq"
    output:
        reads_rRNA = temp("Clean/3.sorted/{exp}_reads_rRNA.fastq"),
        non_rRNA = temp("Clean/3.sorted/{exp}_reads_nonRNA.fastq")
    params:
        reads_rRNA = "Clean/3.sorted/{exp}_reads_rRNA",
        non_rRNA = "Clean/3.sorted/{exp}_reads_nonRNA"            
    shell:
        '''
        sortmerna --aligned {params.reads_rRNA} --other {params.non_rRNA} ...
        '''

rule unmerge_paired:
    input:
        inun = "Clean/3.sorted/{exp}_reads_nonRNA.fastq"  # or rules.sortmeRNA.output.non_rRNA
    output:
        R1 = "Clean/4.Unmerge/{exp}.non_rRNA_R1.fastq",
        R2 = "Clean/4.Unmerge/{exp}.non_rRNA_R2.fastq"
    shell:
        "unmerge-paired-reads.sh {input.inun} {output.R1} {output.R2}"

I removed all the stuff which isn't necessary to understand what is going on, you will have to place those back obviously. I changed the output of sortmeRNA to the actual output of the rule (and made them temp). I also added two params, which are the same as the output but then without the fastq extension.