2
votes

Recently I read a paper that proposed algorithm for mining Maximum Contiguous patterns from DNA data. The proposed method, which sounds pretty interesting, used the following model of MapReduce. map->map->reduce->reduce. That is, First map phase is executed and its output is input to the second phase map. The second phase map's output is input to the first phase reduce. The output of the first phase reduce is input to the second phase reduce and finally the results are flushed into HDFS. Although it seems like an interesting method, the paper didn't mention how they have implemented it. My question is, how do you implement this sort of MapReduce chaining?

3
Thanks. I didn't actually know how to accept a question:) I tried to "vote up" but coudldn'tAhmedov

3 Answers

1
votes

In Hadoop, as far as I know, you cannot do this as of now.

One approach can be to use ChainMapper to do the map->map->reduce part. Then, send the result of this job to another job, and set the mapper to IdentityMapper and the reducer to the second phase reducer that you have.

0
votes

Please read about TEZ . M->M->R->R->R any combo is supported there

0
votes

I think there are two methods to deal with your case:

  1. Integrate the two maps function code into one map task with two phase. Reduce task using the same method as map.

  2. Divide the map-map-reduce-reduce progress into two jobs: two maps as first Hadoop job after converting the second map task type to reduce task; two reduces as second Hadoop job after converting first reduce task to map. May be you could use Oozie to deal with Hadoop workflow if submit some hadoop jobs depending on others.