Goal: each node, having a copy of the matrix, reads the matrix, calculates some value via mapper(matrix, key), and emits <key, value>
I'm, trying to use mapper written in python via streaming. There are no reducers. Essentially, I'm trying to do the task similar to https://hadoop.apache.org/docs/current/hadoop-streaming/HadoopStreaming.html#How_do_I_process_files_one_per_map
Approach: I generated an input file (tasks) in the following format (header just for reference):
/path/matrix.csv 0
/path/matrix.csv 0
... 99
Then I run (hadoop streaming) mapper on this tasks. Mapper parses line to get the arguments - filename, key; then mapper reads the matrix by filename and calculates the value associated with the key; then emits <key, value>.
Problem: current approach works and produces correct results, but it does so in one mapper, since the input file size is mere 100 lines of text, and it is not getting split into several mappers. How do I force such split despite small input size?