0
votes

I have a large data in text files (1,000,000 lines) .Each line has 128 columns .

Now i am trying to build a kd tree with this large data . I want to use map reduce for calculations.

Brute Force approach for my problem:
1) write a map reduce job to find variance of each column and select the column with highest variance
2) taking (column name ,variance value ) as inputs write another map reduce job to split the input data into 2 parts . 1 part has all the rows with value less than input value for the given column name the second part has all the rows greater than input value.
3) for each part repeat step 1 and step 2 , continue the process until you are left with 500 values in each part.

the column name , variance value forms a single node for my tree . so with the brute force approach for tree of height 10 i need to run 1024 map reduce jobs.

My questions:
1 ) Is there any way i can improve the efficiency by running less number of map reduce jobs ?
2 ) I am reading the same data every time . Is there any way i can avoid that ?
3 ) are there any other frameworks like pig , hive etc which are efficient for this kind of tasks ?
4 ) Any frameworks using which i can save the data into a data store and easily retrieve data ?

Pleas help ...

2

2 Answers

2
votes

Why don't you try using Apache Spark (https://spark.apache.org/) here ?...this seems like a perfect use case for spark

1
votes

With an MR job per node of the tree you have O(n) = 2^n number of jobs (where n is the height of the tree), which is not good for the overheads of the YARN. But with simple programming tricks you can bring it down to the O(n) = n. Here are some ideas:

  1. Add extra partition column in front of your key, this column is nodeID (each node in your tree has unique ID). This will create independent data flows and will ensure that keys from different branches of the tree do not mix and all of the variances are calculated in the context of the nodeID in waves, for each layer of nodes. This will remove the necessity of having an MR job per node with very little change in the code and ensure that you have O(n) = n number of jobs and not O(n) = 2^n;
  2. Data is not sorted around the split value and while splitting elements from parent list will have to travel to their destination child lists and there will be network traffic between the cluster nodes. Thus caching the whole data set on the cluster with multiple machines might not give significant improvements;
  3. After calculating a couple of levels of the tree, there can be a situation that certain nodeIDs have a number of rows that can fit in the memory of the mapper or the reducer, then you could continue processing that sub-tree completely in memory and avoid costly MR job, this could reduce the number of MR jobs as you get to the bottom of the tree or reduce the amount of data as the processing gets closer to the bottom;
  4. Another optimisation would be to write a single MR job that in the mapper does the splitting around the selected value of each node and outputs them via MultipleOutputs and emits the keys with child nodeIDs of the next tree level to the reducer to calculate the variance of the columns within the child lists. Of cause the first ever run has no splitting value, but all subsequent runs will have multiple split values, one for each child nodeID.