How to use output directories to aggregate files (and receive more informative error messages)?

Question

The overall problem I'm trying to solve is a way to count the number of reads present in each file at every step of a QC pipeline I'm building. I have a shell script I've used in the past which takes in a directory and outputs the number of reads per file. Since I'm looking to use a directory as input, I tried following the format laid out by Rasmus in this post:

https://bitbucket.org/snakemake/snakemake/issues/961/rule-with-folder-as-input-and-output

Here is some example input created earlier in the pipeline:

$ ls -1 cut_reads/
97_R1_cut.fastq.gz
97_R2_cut.fastq.gz
98_R1_cut.fastq.gz
98_R2_cut.fastq.gz
99_R1_cut.fastq.gz
99_R2_cut.fastq.gz

And a simplified Snakefile to first aggregate all reads by creating symlinks in a new directory, and then use that directory as input for the read counting shell script:

import os

configfile: "config.yaml"

rule all:
    input:
        "read_counts/read_counts.txt"

rule agg_count:
    input:
        cut_reads = expand("cut_reads/{sample}_{rdir}_cut.fastq.gz", rdir=["R1", "R2"], sample=config["forward_reads"])
    output:
        cut_dir = directory("read_counts/cut_reads")
    run:
        os.makedir(output.cut_dir)
        for read in input.cut_reads:
            abspath = os.path.abspath(read)       
            shell("ln -s {abspath} {output.cut_dir}")

 rule count_reads:
    input:
        cut_reads = "read_counts/cut_reads"
    output:
        "read_counts/read_counts.txt"
    shell:
        '''
        readcounts.sh {input.cut_reads} >> {output}
        '''

Everything's fine in the dry-run, but when I try to actually execute it, I get a fairly cryptic error message:

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1   agg_count
    1   all
    1   count_reads
    3

[Tue Jun 18 11:31:22 2019]
rule agg_count:
    input: cut_reads/99_R1_cut.fastq.gz, cut_reads/98_R1_cut.fastq.gz, cut_reads/97_R1_cut.fastq.gz, cut_reads/99_R2_cut.fastq.gz, cut_reads/98_R2_cut.fastq.gz, cut_reads/97_R2_cut.fastq.gz
output: read_counts/cut_reads
    jobid: 2

 Job counts:
    count   jobs
    1   agg_count
    1
[Tue Jun 18 11:31:22 2019]
Error in rule agg_count:
    jobid: 0
    output: read_counts/cut_reads

Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/douglas/snakemake/scrap_directory/.snakemake/log/2019-06-18T113122.202962.snakemake.log

read_counts/ was created, but there's no cut_reads/ directory inside. No other error messages are present in the complete log. Anyone know what's going wrong or how to receive a more descriptive error message?

I'm also (obviously) fairly new to snakemake, so there might be a better way to go about this whole process. Any help is much appreciated!

salanova.elliott salanova.elliott · Accepted Answer · 2019-06-18T23:24:32

... And it was a typo. Typical. os.makedir(output.cut_dir) should be os.makedirs(output.cut_dir). I'm still really curious why snakemake isn't displaying the AttributeError python throws when you try to run this:

AttributeError: module 'os' has no attribute 'makedir'

Is there somewhere this is stored or can be accessed to prevent future headaches?

How to use output directories to aggregate files (and receive more informative error messages)?

3 Answers