1
votes

I have a quick question regarding the use of dynamic wildcards. I have searched the documentation and forums, but have not found a straightforward answer to my query.

Here are the rules that are giving me trouble:

rule all:
input: dynamic("carvemeOut/{species}.xml")
shell:"snakemake --dag | dot -Tpng > pipemap.png"

rule speciesProt:
input:"evaluation-output/clustering_gt1000_scg.tab"
output: dynamic("carvemeOut/{species}.txt")
shell:
    """
    cd {config[paths][concoct_run]}
    mkdir -p {config[speciesProt_params][dir]}
    cp {input} {config[paths][concoct_run]}/{config[speciesProt_params][dir]}
    cd {config[speciesProt_params][dir]}
    sed -i '1d' {config[speciesProt_params][infile]} #removes first row
    awk '{{print $2}}' {config[speciesProt_params][infile]} > allspecies.txt #extracts node information
    sed '/^>/ s/ .*//' {config[speciesProt_params][metaFASTA]} > {config[speciesProt_params][metaFASTAcleanID]} #removes annotation to protein ID
    Rscript {config[speciesProt_params][scriptdir]}multiFASTA2speciesFASTA.R
    sed -i 's/"//g' species*
    sed -i '/k99/s/^/>/' species*
    sed -i 's/{config[speciesProt_params][tab]}/{config[speciesProt_params][newline]}/' species*
    cd {config[paths][concoct_run]}
    mkdir -p {config[carveme_params][dir]}
    cp {config[paths][concoct_run]}/{config[speciesProt_params][dir]}/species* {config[carveme_params][dir]}
    cd {config[carveme_params][dir]}
    find . -name "species*" -size -{config[carveme_params][cutoff]} -delete #delete files with little information, these cause trouble
    """

rule carveme:
input: dynamic("carvemeOut/{species}.txt")
output: dynamic("carvemeOut/{species}.xml")
shell:
    """
    set +u;source activate concoct_env;set -u
    cd {config[carveme_params][dir]}
    echo {input}
    echo {output}
    carve $(basename {input})
    """

I was previously using two different widlcards for the input and output of the carveme rule:

input: dynamic("carvemeOut/{species}.txt")
output: dynamic("carvemeOut/{gem}.xml")

What I want snakemake to do is to run the carveme rule multiple times, to create an output .xml file for each input .txt file. However, snakemake is instead running the rule one time, using a list of inputs to create one output, as can be seen below:

rule carveme:
input: carvemeOut/species2.txt, carvemeOut/species5.txt, carvemeOut/species1.txt, carvemeOut/species10.txt, carvemeOut/species4.txt, carvemeOut/species17.txt, carvemeOut/species13.txt, carvemeOut/species8.txt, carvemeOut/species14.txt
output: {*}.xml (dynamic)
jobid: 28

After modifying my rules to use the same wildcard, as suggested by @stovfl and shown in the first code box, I get the following error message:

$ snakemake all
Building DAG of jobs...
WildcardError in line 174 of /c3se/NOBACKUP/groups/c3-c3se605-17-8/projects_francisco/binning/snakemake-concot/Snakefile:
Wildcards in input files cannot be determined from output files:
species

Any suggestions on how to address this problem?

Thanks in advance, FZ

1
Try according the Snakemake Tutorial step-2-generalizing-the-read-mapping-rulestovfl
@stovfl thanks for your response. I have tried using the same wildcard for input and output, but when I call rule all, it cannot determine the wildcard. Error output shown below: $ snakemake all Building DAG of jobs... WildcardError in line 174 of /c3se/NOBACKUP/groups/c3-c3se605-17-8/projects_francisco/binning/snakemake-concot/Snakefile: Wildcards in input files cannot be determined from output files: species I will modify my original post to reflect the changes and add rule all.Francisco Zorrilla
The tutorial dosn't use dynamic? Strip down your testcase to only rule carveme:. In this rule remove all from shell, except the two echo {...}. Try and add line by line. You are using relative filepath and do cd ..., this is contradict. Second you are defining output but don't us it?stovfl
True, the tutorial does not use dynamic. This was recommended to me by another user. I was trying to strip down my testcase to only the carveme rule, but the problem is that it has wildcards in its input and output, which means that it cannot be my target rule. Also true that I am not using the output of the carveme rule, but from what I understand, I need to specify it so that rule all knows where to find it! Could you expand on the relative filepath comment? How does it contradict? Thanks for the help!Francisco Zorrilla
"the relative filepath comment":With cd ... you change the root directory. This could lead to, that the realtive filepath carvemeOut/ is not reachable. It's a guess, as I don' t know how {config[carveme_params][dir]} expands to.stovfl

1 Answers

1
votes

You want to have dynamic in your rule all and the rule where the dynamic output is created but not in your last output.

Here is a working example. Given an input file of species as an example species_example.txt:

SpeciesA
SpeciesB
SpeciesC
SpeciesD

The following Snakefile will produce dynamically 4 output files

#Snakefile
rule all:
input: 
    dynamic("carvemeOut/{species}.xml"),

rule speciesProt:
    input: "species_example.txt"
    output: dynamic("carvemeOut/{species}.txt")
shell:  
    """
    awk '{{gsub(/\\r/,"",$1);print  > "carvemeOut/"$1".txt";}}' {input}
    """


rule carveme:
    input: "carvemeOut/{species}.txt"
    output: "carvemeOut/{species}.xml"
    shell: "cat {input} > {output}"

Dynamic has a lot of restrictions currently in Snakemake (only one dynamic wildcard allowed see Francisco's comment below, no mixing of non-dynamic and dynamic outputs in the same rule) hence I avoid it if possible. For example, instead of making this example dynamic I would have used a pyhton function to produces a list of possible species names before any rule is ran and use that to expand the wildcards in rule all. Are you sure you need dynamic output?

Also, you should avoid writing such long shell portions directly in the Snakefile and use either external scripts or break that shell command into multiple rules.