0
votes

In a snakemake bioinformatics workflow aimed at mapping long reads, input fastq data could have a range of possible file extensions depending on user preference and file compression format (e.g. I expect to observed both sequence.fastq or sequence.fq.gz). Wildcards can be used to select for the input files - I am having a challenge with the naming of output files. In a single workflow I am expecting that we could see samples that are uncompressed, gzipped and bzip2 compressed.

Is there a recommended way for stripping the file extension such that the output of a mapping could be sequence.bam rather than the excessive sequence.fq.gz.bam?

Any hints would be very gratefully received - thanks S

1
Show your minimal reproducible example so potential answers have some context. Maybe edit the question and make it fit the expected format: what are you trying to do, what did you try, and what results did you get.user1531971

1 Answers

0
votes

There are a few ways of doing this. One way without the need for regex but under the naming standard that the sequence id and extension are separated by a single dot is using split only on first occurence. For example, if you had filename in which there are two identifiers or more separated with a . (e.g. {sequence_id1}.{sequence_id2}.{extension}) you would have to use regex but the logic is similar to below.

You can test this example by creating an input directory with the following files:

$ mkdir input
$ for i in 1 2 3;do touch input/sequence"$i";done
$ for i in 4 5 6;do touch input/sequence"$i".gz;done
$ for i in 7 8 9;do touch input/sequence"$i".fq.gz;done
$ for i in 10 11 12;do touch input/sequence"$i".bzip;done

The following Snakefile implementation would do what you require and would allow performing different actions depending on whether there is an extension and what type of extension it is:

###Snakefile
# Get all filenames in a specific input directory
wildcards = glob_wildcards('input/{fq_files}')
# Split the filenames into basename and extension
files = [filename.split('.', 1) for filename in wildcards.fq_files]
# Create a dictionary of mapping basename:extension
file_dict = {filename[0]: filename[1] if len(
    filename) == 2 else '' for filename in files}

rule all:
    input:
        expand('output/{seqid}.bam', seqid=file_dict.keys()),

rule generate_bams:
    input:
        lambda wc: f'input/{{seqid}}.{file_dict[wc.seqid]}' if file_dict[wc.seqid] != '' else 'input/{seqid}',
    output: 'output/{seqid}.bam',
    run:
        if (file_dict[wildcards.seqid] == 'gz'):
            shell(
                'echo "FILENAME = {input}\nFile has gz in filename" > {output}')
        elif (file_dict[wildcards.seqid] == 'fq.gz'):
            shell(
                'echo "FILENAME = {input}\nFile has fq.gz in filename" > {output}')
        elif (file_dict[wildcards.seqid] == 'bzip'):
            shell(
                'echo "FILENAME = {input}\nFile has bzip in filename" > {output}')
        else:
            shell(
                'echo "FILENAME = {input}\nFile has no extension in filename" > {output}')

Since glob_wildcards will not keep the extension to basename mapping, you can split the filename and create a basename:extension mapping. After creating this dictionary, you can always access it during input and params specification or run/shell directives. You can do similar with regex by changing the split part into a regex and obtaining the matching groups.

I am using Python 3.6 and if you have Python that is below 3.6 you can change the string literal portion:

lambda wc: f'input/{{seqid}}.{file_dict[wc.seqid]}' if file_dict[wc.seqid] != '' else 'input/{seqid}',

to:

lambda wc: 'input/{{seqid}}.{0}'.format(file_dict[wc.seqid]) if file_dict[wc.seqid] != '' else 'input/{seqid}',

So if there are filenames w/o extensions or with various extensions you can account for them using glob_wilcards, split and a dictionary for storing the basename:extension mapping.

In the future you should try to provide example input, output and if not a working example at least what you tried.


EDIT

I assumed that you do not have same sequence names with different extensions. If you do want to account for that as well, you can map a sequence name to multiple extensions. You can do this by changing:

### Create a dictionary of mapping basename:extension
# file_dict = {filename[0]: filename[1] if len(
#     filename) == 2 else '' for filename in files}
### Creata dictionary of mapping basename: [extensions]
file_dict = {}
for filename in files:
    if len(filename) == 2:
        file_dict.setdefault(filename[0],[]).append(filename[1])
    else:
        file_dict.setdefault(filename[0],[]).append('')
print(file_dict)

Doing this you would also need to change the lambda function to a for loop or list comprehension in the input directive and change the conditions in if/elif(condition).