Define input files from csv

Question

I would like to define input file names from different varialbles extracted from a csv. I have built the following simplified example:

I have a file test.csv:

data/samples/A.fastq
data/samples/B.fastq

I give the path to test.csv in a json config file:

{
  "samples": {
    "summaryFile": "somepath/test.csv"
  }
}

Now I want to run bwa on each file within a rule. My feeling is that I have to use lambda wildcards but I am not sure. My Snakefile looks like this:

#only for bcf_tools

import pandas

input_table = config["samples"]["summaryFile"]
samplesData = pandas.read_csv(input_table)


def returnSamples(table):
  # Have tried different things here but nothing worked
  return table


rule all:
    input:
        expand("mapped_reads/{sample}.bam", sample= samplesData)

rule bwa_map:
    input:
        "data/genome.fa",
        lambda wildcards: returnSamples(wildcards.sample)
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

I have tried a million things including using expand (which is working but the rule is not called on each file).

Any help will be tremendously appreciated.

Maarten-vd-Sande Maarten-vd-Sande · Accepted Answer · 2019-11-19T12:29:05

Snakemake works by defining which output you want (like you do in rule all). You are very close to a working solution, however there were some small things that went wrong:

Reading the pandas dataframe does not do what you expect (try printing the samplesData to see what it did/does). Therefore the expand in rule all does not work properly.
You do not need to use lambdas for the input, you can reuse the wildcard.

This should work for your example:

import pandas
import re

input_table = config["samples"]["summaryFile"]
samplesData = pandas.read_csv(input_table, header=None).loc[:, 0].tolist()
samples = [re.findall("[^/]+\.", sample)[0][:-1] for sample in samplesData]  # overly complicated regex

rule all:
    input:
        expand("mapped_reads/{sample}.bam", sample=samples)

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

However I think it would be easiest to change the description in test.csv. Now we have to do some weird magic to get the sample name from the file, it would probably be best to just store the sample names there.

Define input files from csv

1 Answers