How to conduct Data Preprocessing with Hadoop MapReduce or Spark?

Question

I'm very new to the Hadoop MapReduce/Spark, for my target project, I want to perform Data Preprocessing with Hadoop MapReduce/Spark. I know the basics of Hadoop MapReduce, but I don't know how to implement the Preprocessing algorithms/methods with this framework. For Hadoop MapReduce, I have to define Map() and Reduce() which takes <key, value> pair as the transmission type from Mappers to Reducers. But with database tables, how can I handle table entries in <key, value> format? Applying primay key as the key seems nonsense. It's the similar case for Spark since I need to specify the key.

For example, for each data entry in the database table, some fields of some entries may be missed, thus I want to add the default value for those fields with kind of imputation strategies. How can I process the data entries in a <key, value> way? Setting the primary key as key here is nonsense since if that's the case, each <key, value> pair won't have the same key as others, so aggregation is not helpful in this case.

Gopal Rajput Gopal Rajput · Accepted Answer · 2017-02-16T17:27:12

Map reduce is kind of low level programming. You can start with high level abstractions like HIVE and PIG.

If in case you are dealing with structured data you go with HIVE, which is SQL like interface, which intenally converts SQLs to MR jobs.

Hope this helps.

How to conduct Data Preprocessing with Hadoop MapReduce or Spark?

1 Answers