I have a hive query with custom mapper and reducer written in python. The mapper and reducer modules depend on some 3rd party modules/packages which are not installed on my cluster (installing them on the cluster is not an option). I realized this problem only after running the hive query when it failed saying that the xyz module was not found.
How do I package the whole thing so that I have all the dependencies (including transitive dependencies) available in my streaming job? How do I use such a packaging and import modules in my mapper and reducer?
The question is rather naive but I could not find an answer even after an hour of googling. Also, it's not just specific to hive but holds for hadoop streaming jobs in general when mapper/reducer is written in python.