0
votes

I am having trouble fixing a problem I have in snakemake. My samples are currently named something like "1-2-Brain_187_006_S77_L002_R1_001.fastq.gz". I want to eventually rename them to a shorter name like "1_2_Brain_S77_L002_R1" and then with the extension "_trim.fastq.gz" for my rule. I am trimming using bbduk. For my input I want to call my list of dictionaries, allSamples. Then I want to access values from each dictionary. Specifically, the "shortName1" and "shortName2" values. My problem is in my dry it shows the entire list as input for a single run. I'm not sure how to make it register that each element is it's own run. I am showing 3 file names as an example when I actually have 114. So, I would expect my dry run to have 114 counts for the trim job.

config.json

{
   "allSamples" : ['1_2_Brain_S77_L002', '10_4_Kidney_S82_L002', '11_4_BB_S105_L002' ......],

   "1_2_Brain_S77_L002":{
        "sampleName1": "1-2-Brain_187_006_S77_L002_R1_001.fastq.gz",
        "sampleName2": "1-2-Brain_187_006_S77_L002_R2_001.fastq.gz",
        "shortName1": "1_2_Brain_S77_L002_R1",
        "shortName2": "1_2_Brain_S77_L002_R2",
        "stemName": "1_2_Brain_S77_L002"
        }, ....
}

I'm taking my files located in rawReads/ and storing the new trimmed files in trimmedReads/ .

Snakefile

configfile: "refs/config.json"

# variables
sampleDict = config["allSamples"]
sampleNames1 = [config[i]["sampleName1"] for i in sampleDict]
sampleNames2 = [config[i]["sampleName2"] for i in sampleDict]
shortNames1 = [config[i]["shortName1"] for i in sampleDict]
shortNames2 = [config[i]["shortName2"] for i in sampleDict]

rule all:
    input: 
        expand("trimmedReads/{trim1}_trim.fastq.gz", trim1 = shortNames1),
        expand("trimmedReads/{trim2}_trim.fastq.gz", trim2 = shortNames2)

rule trim:
    input:
        R1 = expand("rawReads/{sample1}", sample1 = sampleNames1),
        R2 = expand("rawReads/{sample2}", sample2 = sampleNames2)
    output:
        trim1 = expand("trimmedReads/{trim1}_trim.fastq.gz", trim1 = shortNames1),
        trim2 = expand("trimmedReads/{trim2}_trim.fastq.gz", trim2 = shortNames2)
    shell:
        """
        bbduk.sh in1={input.R1} in2={input.R2} out1={output.trim1} out2={output.trim2} ref=ref/adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo
        """

When I do a dry run I get this.

Building DAG of jobs...
Job counts:
    count   jobs
    1   all
    1   trim
    2

[Mon May 24 22:42:36 2021]
rule trim:
    input: rawReads/1-2-Brain_187_006_S77_L002_R1_001.fastq.gz, rawReads/10-4-Kidney_127_066_S82_L002_R1_001.fastq.gz, rawReads/11-4_BB_041_152_S105_L002_R1_001.fastq.gz, ...
    output: trimmedReads/1_2_Brain_S77_L002_R1_trim.fastq.gz, trimmedReads/10_4_Kidney_S82_L002_R1_trim.fastq.gz, trimmedReads/11_4_BB_S105_L002_R1_trim.fastq.gz, ...
    jobid: 1


bbduk.sh in1=rawReads/1-2-Brain_187_006_S77_L002_R1_001.fastq.gz rawReads/10-4-Kidney_127_066_S82_L002_R1_001.fastq.gz rawReads/11-4_BB_041_152_S105_L002_R1_001.fastq.gz ... out1=trimmedReads/1_2_Brain_S77_L002_R1_trim.fastq.gz trimmedReads/10_4_Kidney_S82_L002_R1_trim.fastq.gz trimmedReads/11_4_BB_S105_L002_R1_trim.fastq.gz ... ref=ref/adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo
        

[Mon May 24 22:42:36 2021]
localrule all:
    input: trimmedReads/1_2_Brain_S77_L002_R1_trim.fastq.gz, trimmedReads/10_4_Kidney_S82_L002_R1_trim.fastq.gz, trimmedReads/11_4_BB_S105_L002_R1_trim.fastq.gz, ...
    jobid: 0

Job counts:
    count   jobs
    1   all
    1   trim
    2
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
1

1 Answers

1
votes

The wildcards for the rule test: is an empty dict. There is no wildcard value wildcards.sample specified within this rule. Each wildcard shall be specified in the output: section, and this section is empty for this rule. Actually your rule test: has absolutely no effect unless you specify it explicitly as a target: without any output specified, Snakemake just ignores this useless rule that produces nothing.

I guess that the files ["rawReads/1_2_Brain_S77_L002", "rawReads/17_6_Brain_S83_L002"] already exist, so Snakemake finds that the target exists on the disk, and does nothing, producing "no output".

I don't understand what you mean by "eventually rename them to a shorter name", but here is a recipe of how to copy the files. Take this as a pattern of "how to use wildcards to access my sample names":

rule all:
    input: ["path_to_target/foo_SampleName1_bar", "path_to_target/foo_SampleName2_bar"]
    # List the files you expect to get as a target

rule copy:
    input:
        "path_to_source/blablabla_{sample}_bazz"
    output:
        "path_to_target/foo_{sample}_bar"
    shell:
        "echo {input}; cp {input} {output}"

How it works:

  1. Snakemake discovers that it needs to produce some files (in out case these files are "path_to_target/foo_SampleName1_bar", "path_to_target/foo_SampleName2_bar").
  2. Snakemake finds out that the rule copy: claims the output that (if to substitute {sample} with the value "SampleName1") matches the filename "path_to_target/foo_SampleName1_bar"
  3. If the file "path_to_source/blablabla_SampleName1_bazz" exists, Snakemake is satisfied, and knows how to produce the file "path_to_target/foo_SampleName1_bar"
  4. The steps 2, 3 are repeated for the {sample} value "SampleName2".
  5. Now it knows that the rule copy: shall be run two times: one time for each file.
  6. All dependencies are resolved, Snakemake can start the pipeline.