How to use a list in Snakemake tabular config.
I use Snakemake Tabular (mapping with BWA mem) configuration to describe my sequencing units (libraries sequenced on separate lines). At the next stage of analysis I have to merge sequencing units (mapped .bed files) and take merged .bam files (one for each sample). Now I'm using YAML config for describing of what units belong to what samples. But I wish to use Tabular config for this purpose,
I'm not clear how to write and recall a list (containing sample information) from cell of Tab separated file.
This is how my Tablular config for units looks like:
Unit SampleSM LineID PlatformPL LibraryLB RawFileR1 RawFileR2
sample_001.lane_L1 sample_001 lane_L1 ILLUMINA sample_001 /user/data/sample_001.lane_L1.R1.fastq.gz /user/data/sample_001.lane_L1.R2.fastq.gz
sample_001.lane_L2 sample_001 lane_L2 ILLUMINA sample_001 /user/data/sample_001.lane_L2.R1.fastq.gz /user/data/sample_001.lane_L2.R2.fastq.gz
sample_001.lane_L8 sample_001 lane_L8 ILLUMINA sample_001 /user/data/sample_001.lane_L8.R1.fastq.gz /user/data/sample_001.lane_L8.R2.fastq.gz
sample_002.lane_L1 sample_002 lane_L1 ILLUMINA sample_002 /user/data/sample_002.lane_L1.R1.fastq.gz /user/data/sample_002.lane_L1.R2.fastq.gz
sample_002.lane_L2 sample_002 lane_L2 ILLUMINA sample_002 /user/data/sample_002.lane_L2.R1.fastq.gz /user/data/sample_002.lane_L2.R2.fastq.gz
This is how my YAML config for Samples looks like:
samples:
"sample_001": ["sample_001.lane_L1", "sample_001.lane_L2", "sample_001.lane_L8"]
"sample_002": ["sample_002.lane_L1", "sample_002.lane_L2"]
My Snakemake code:
import pandas as pd
import os
workdir: "/user/data/snakemake/"
configfile: "Samples.yaml"
units_table = pd.read_table("Units.tsv").set_index("Unit", drop=False)
rule all:
input:
expand('map_folder/{unit}.bam', unit=units_table.Unit),
expand('merge_bam_folder/{sample}.bam', sample=config["samples"]),
rule map_paired_end:
input:
r1 = lambda wildcards: expand(units_table.RawFileR1[wildcards.unit]),
r2 = lambda wildcards: expand(units_table.RawFileR2[wildcards.unit])
output:
bam = 'map_folder/{unit}.bam'
params:
bai = 'map_folder/{unit}.bam.bai',
ref='/user/data/human_g1k_v37.fasta.gz',
SampleSM = lambda wildcards: units_table.SampleSM[wildcards.unit],
LineID = lambda wildcards: units_table.LineID[wildcards.unit],
PlatformPL = lambda wildcards: units_table.PlatformPL[wildcards.unit],
LibraryLB = lambda wildcards: units_table.LibraryLB[wildcards.unit]
threads:
16
shell:
r"""
seqtk mergepe {input.r1} {input.r2}\
| bwa mem -M -t {threads} -v 3 \
{params.ref} - \
-R "@RG\tID:{params.LineID}\tSM:{params.SampleSM}\tPL:{params.PlatformPL}\tLB:{params.LibraryLB}"\
| samtools view -u -Sb - \
| samtools sort - -m 4G -o {output.bam}
samtools index {output.bam}
"""
rule samtools_merge_bam:
input:
lambda wildcards: expand('map_folder/{file}.bam', file=config['samples'][wildcards.sample])
output:
bam = 'merge_bam_folder/{sample}.bam'
threads:
1
shell:
r"""
samtools merge {output.bam} {input}
samtools index {output.bam}
"""
samples
in one column andsample.lane
in another column, with the latter as a string using comma or semicolon as delimiter, and then usingsplit()
on them to convert to list? However, I do agree that readability wouldn't be easy if done this way. – Manavalan Gajapathy