There are a few ways of doing this. One way without the need for regex but under the naming standard that the sequence id and extension are separated by a single dot is using split only on first occurence. For example, if you had filename in which there are two identifiers or more separated with a .
(e.g. {sequence_id1}.{sequence_id2}.{extension}
) you would have to use regex but the logic is similar to below.
You can test this example by creating an input directory with the following files:
$ mkdir input
$ for i in 1 2 3;do touch input/sequence"$i";done
$ for i in 4 5 6;do touch input/sequence"$i".gz;done
$ for i in 7 8 9;do touch input/sequence"$i".fq.gz;done
$ for i in 10 11 12;do touch input/sequence"$i".bzip;done
The following Snakefile implementation would do what you require and would allow performing different actions depending on whether there is an extension and what type of extension it is:
###Snakefile
# Get all filenames in a specific input directory
wildcards = glob_wildcards('input/{fq_files}')
# Split the filenames into basename and extension
files = [filename.split('.', 1) for filename in wildcards.fq_files]
# Create a dictionary of mapping basename:extension
file_dict = {filename[0]: filename[1] if len(
filename) == 2 else '' for filename in files}
rule all:
input:
expand('output/{seqid}.bam', seqid=file_dict.keys()),
rule generate_bams:
input:
lambda wc: f'input/{{seqid}}.{file_dict[wc.seqid]}' if file_dict[wc.seqid] != '' else 'input/{seqid}',
output: 'output/{seqid}.bam',
run:
if (file_dict[wildcards.seqid] == 'gz'):
shell(
'echo "FILENAME = {input}\nFile has gz in filename" > {output}')
elif (file_dict[wildcards.seqid] == 'fq.gz'):
shell(
'echo "FILENAME = {input}\nFile has fq.gz in filename" > {output}')
elif (file_dict[wildcards.seqid] == 'bzip'):
shell(
'echo "FILENAME = {input}\nFile has bzip in filename" > {output}')
else:
shell(
'echo "FILENAME = {input}\nFile has no extension in filename" > {output}')
Since glob_wildcards
will not keep the extension to basename mapping, you can split the filename and create a basename:extension mapping. After creating this dictionary, you can always access it during input and params specification or run/shell directives. You can do similar with regex by changing the split part into a regex and obtaining the matching groups.
I am using Python 3.6 and if you have Python that is below 3.6 you can change the string literal portion:
lambda wc: f'input/{{seqid}}.{file_dict[wc.seqid]}' if file_dict[wc.seqid] != '' else 'input/{seqid}',
to:
lambda wc: 'input/{{seqid}}.{0}'.format(file_dict[wc.seqid]) if file_dict[wc.seqid] != '' else 'input/{seqid}',
So if there are filenames w/o extensions or with various extensions you can account for them using glob_wilcards
, split
and a dictionary for storing the basename:extension mapping.
In the future you should try to provide example input, output and if not a working example at least what you tried.
EDIT
I assumed that you do not have same sequence names with different extensions. If you do want to account for that as well, you can map a sequence name to multiple extensions. You can do this by changing:
### Create a dictionary of mapping basename:extension
# file_dict = {filename[0]: filename[1] if len(
# filename) == 2 else '' for filename in files}
### Creata dictionary of mapping basename: [extensions]
file_dict = {}
for filename in files:
if len(filename) == 2:
file_dict.setdefault(filename[0],[]).append(filename[1])
else:
file_dict.setdefault(filename[0],[]).append('')
print(file_dict)
Doing this you would also need to change the lambda function to a for loop or list comprehension in the input
directive and change the conditions in if/elif(condition)
.