0
votes

Goal: each node, having a copy of the matrix, reads the matrix, calculates some value via mapper(matrix, key), and emits <key, value>

I'm, trying to use mapper written in python via streaming. There are no reducers. Essentially, I'm trying to do the task similar to https://hadoop.apache.org/docs/current/hadoop-streaming/HadoopStreaming.html#How_do_I_process_files_one_per_map

Approach: I generated an input file (tasks) in the following format (header just for reference):

/path/matrix.csv 0
/path/matrix.csv 0
...              99

Then I run (hadoop streaming) mapper on this tasks. Mapper parses line to get the arguments - filename, key; then mapper reads the matrix by filename and calculates the value associated with the key; then emits <key, value>.

Problem: current approach works and produces correct results, but it does so in one mapper, since the input file size is mere 100 lines of text, and it is not getting split into several mappers. How do I force such split despite small input size?

1
Why do you need it? Let's leave the decision to framework on number of mappers. - Ravindra babu
Please reread the problem statement. My input have to be small, because it only contains a file path and the key. The file would be big, but fits to memory of each machine. So I just need to run in parallel many machines on the same data with different key. - alexsalo

1 Answers

0
votes

I realized that instead of doing several mappers and no reducers I could do just the exact opposite. Now my architecture is as follows:

  • thin mapper simply reads the input parameters and emits key, value
  • fat reducers read the files and execute algorithm with received key, then emit results
  • set -D mapreduce.job.reduces=10 to change the parallelization level

It was silly (mistaken) approach, but the correct one was not obvious either.