Nextflow - How to avoid random sample IDs for input files in two or more channels with "Join" or similar operator?

Question

I have implemented some NGS data analysis workflows with Nextflow. I used "Paired End" channels (fromFilePairs method) for some of my workflow processes. I ran into a problem I wasn't expecting after multiple workflow executions : my samples ID would sometimes be mixed, resulting in inaccurate outputs for the processes where it happened. I think this is related to the Non-deterministic input channels issue (https://www.nextflow.io/blog/2019/troubleshooting-nextflow-resume.html).

Let's suppose I apply my worklow on these paired-end files : sample1_R{1,2}.fastq, sample2_R{1,2}.fastq

process Step1 {
    input:
        tuple pair_ID, file(A) from channelA
        tuple pair_ID, file(B) from channelB
        tuple pair_ID, file(C) from channelC
...
}

For this kind of process with more than one "tuple pair_ID" as input, the data pair_ID (= my samples names) can be mixed up and my process would end up using randomly input files A and B of the sample1, and the input file C of the sample2 instead of all files (A,B,C) of the same pair_ID (key = only sample1 or only sample2). I had this randomly mixed input filenames issue (which impacted the outputs) after several workflow executions, after using -resume when an error occurred but also after full successful workflow runs.

In order to have the same key (pair_ID) between the input files emitted by each of the 3 channels, I used the join operator:

Process Step1 {
    input:
        tuple pair_ID, file(A), file(B), file(C) from channelA.join(channelB).join(channelC)
...
}

This operator seems to make everything work as expected, I don't see any mix in my sample IDs and in my final outputs. In the doc (https://www.nextflow.io/docs/latest/operator.html?highlight=join#join), join seems to be suited for a 2 channels use only, so I am unsure if I am using it right for 3 channels.

Is my method using join legit ? Or does it still have some flaws ? Is there a better way to correct my issue ? If I am unsure that this method is correct to avoid any mix in my samples ID, I might change to another workflow management system such as Snakemake but I would really like to solve this issue and to continue using Nextflow.

Thank you in advance, don't hesitate if something isn't clear !

If you do not get an answer here, consider asking on the nextflow chat / nextflow gitter channel: gitter.im/nextflow-io/nextflow , which seems much more active. There is also nextflow mailing list on google groups: groups.google.com/g/nextflow . — Timur Shtatland
Thank you, I have already asked it on gitter and I have just asked it on google group too thanks to your advice. — Neul Hyo

Steve Steve · Accepted Answer · 2020-09-21T00:09:33

As you have discovered, you should avoid using the same variable name (pair_ID) more than once in your input block. Using the same variable name does not guarantee the inputs will be joined up using this key. I imagine that whatever value you get for pair_ID from one input channel will just get clobbered by the pair_ID you get from one of your other input channels. You have also discovered that when you declare two or more input channels, the overall input ordering may not be consistent across multiple executions (like when using the -resume).

To join two or more channels with a common key, you can simply use the join operator:

join

The join operator creates a channel that joins together the items emitted by two channels for which exits a matching key. The key is defined, by default, as the first element in each item emitted.

Note that the join operator creates (returns) a new channel. Therefore, this:

joined = channelA.join(channelB).join(channelC)

Is functionally the same as:

temp = channelA.join(channelB)
joined = temp.join(channelC)

Nextflow - How to avoid random sample IDs for input files in two or more channels with "Join" or similar operator?

1 Answers