I'm very new to the Hadoop MapReduce/Spark, for my target project, I want to perform Data Preprocessing with Hadoop MapReduce/Spark. I know the basics of Hadoop MapReduce, but I don't know how to implement the Preprocessing algorithms/methods with this framework. For Hadoop MapReduce, I have to define Map() and Reduce() which takes <key, value> pair as the transmission type from Mappers to Reducers. But with database tables, how can I handle table entries in <key, value> format? Applying primay key as the key seems nonsense. It's the similar case for Spark since I need to specify the key.
For example, for each data entry in the database table, some fields of some entries may be missed, thus I want to add the default value for those fields with kind of imputation strategies. How can I process the data entries in a <key, value> way? Setting the primary key as key here is nonsense since if that's the case, each <key, value> pair won't have the same key as others, so aggregation is not helpful in this case.