Snakemake: output file name seems to require a static path portion

Question

I'm finding that the name of the output file per rule seems to need a static portion, e.g. "data/{wildcard}_data.csv" vs. "{wildcard}_data.csv"

For example, the script below returns the following error on dryrun:

Building DAG of jobs... MissingInputException in line 12 of /home/rebecca/workflows/exploring_tools/affymetrix_preprocess/snakemake/Snakefile: Missing input files for rule getDatFiles: GSE4290

Script:

rule all:
 input: expand("{geoid}_datout.scaled.expr.csv", geoid = config['geoid'], out_dir = config['out_dir'])       
 benchmark: "benchmark.csv"       

rule getDatFiles:
 input: "{geoid}"       
 output: temp("{geoid}_datFiles.RData")       
 shell:
    "Rscript scripts/getDatFiles.R"

rule maskProbes:
 input: "{geoid}_datFiles.RData"       
 output: temp("{geoid}_datFiles.masked.RData")       
 params:
  probeFilterFxn = lambda x: config['probeFilterFxn'],
  minProbeNumber = lambda x: config['minProbeNumber'],
  probeSingle = lambda x: config['probeSingle']
 script: "scripts/maskProbes.R"

rule runExpresso:
 input: "{geoid}_datFiles.masked.RData"       
 output: temp("{geoid}_datout.RData")       
 params:
  bgcorrect_method = lambda x: config['bgcorrect_method'],
  normalize = lambda x: config['normalize'],
  pmcorrect_method = lambda x: config['pmcorrect_method'],
  summary_method = lambda x: config['summary_method']
 script: "scripts/runExpresso.R"

rule scaleData:
 input: "{geoid}_datout.RData"       
 output: temp("{geoid}_datout.scaled.RData")       
 params: sc = lambda x: config['sc']
 script: "scripts/scaleData.R"

rule getExpr:
 input: "{geoid}_datout.scaled.RData"       
 output: temp("{geoid}_datout.scaled.expr.csv")       
 script: "scripts/getExpr.R"

... While the following script runs without error (the difference being including "output/" ahead of the output file names:

rule all:
 input: expand("output/{geoid}_datout.scaled.expr.csv", geoid = config['geoid'], out_dir = config['out_dir'])
 benchmark: "output/benchmark.csv"

rule getDatFiles:
 input: "output/{geoid}"
 output: temp("output/{geoid}_datFiles.RData")
 shell:
    "Rscript scripts/getDatFiles.R"

rule maskProbes:
 input: "output/{geoid}_datFiles.RData"
 output: temp("output/{geoid}_datFiles.masked.RData")
 params:
  probeFilterFxn = lambda x: config['probeFilterFxn'],
  minProbeNumber = lambda x: config['minProbeNumber'],
  probeSingle = lambda x: config['probeSingle']
 script: "scripts/maskProbes.R"

rule runExpresso:
 input: "output/{geoid}_datFiles.masked.RData"
 output: temp("output/{geoid}_datout.RData")
 params:
  bgcorrect_method = lambda x: config['bgcorrect_method'],
  normalize = lambda x: config['normalize'],
  pmcorrect_method = lambda x: config['pmcorrect_method'],
  summary_method = lambda x: config['summary_method']
 script: "scripts/runExpresso.R"

rule scaleData:
 input: "output/{geoid}_datout.RData"
 output: temp("output/{geoid}_datout.scaled.RData")
 params: sc = lambda x: config['sc']
 script: "scripts/scaleData.R"

rule getExpr:
 input: "output/{geoid}_datout.scaled.RData"
 output: temp("output/{geoid}_datout.scaled.expr.csv")
 script: "scripts/getExpr.R"

I'm having a hard time understanding why this might be happening. Ultimately, I'd like to workflows that are as possible, and ideally, that entails making the output directory variable.

Any insight would be much appreciated.

dariober dariober · Accepted Answer · 2020-11-24T09:35:24

You have:

rule getDatFiles:
 input: "{geoid}"

which means there should be a file in the current directory named just {geoid}, e.g. ./GSE4290. I suspect what you want is:

rule getDatFiles:
    input: "data/{geoid}_data.csv"
...

input: "output/{geoid}" works maybe because there is already a file named output/GSE4290 created elsewhere.

(I haven't looked the rest of the scripts)

Snakemake: output file name seems to require a static path portion

2 Answers