6
votes

I've been using either Pig or Java for Map Reduce exclusively for running jobs against a Hadoop cluster thus far. I've recently tried out using Python Map Reduce through the Hadoop streaming and that was pretty cool as well. All of these make sense to me, but I'm a little hazy on when I would want to use one implementation v.s. another. Java map reduce, I've been using basically exclusively when I need speed, but when would I ever want to use something like Python streaming instead of just writing out the same thing in fewer, more easily understandable lines in PIG/Hive? In short, what are the pros and cons to each?

3
If you downvote and vote to close, why not add a comment and mention why, so that I don't do whatever it is you think I did wrong in the future?Eli
lucene.472066.n3.nabble.com/… is a relevant thread to this discussion.Eli

3 Answers

3
votes

I will separately relate to Java vs Python and then separately relate to MR vs Hive / Pig - since i see it as two different issues
Hadoop is built around java and many of its capabilities available via Java API, and Hadoop mostly can be extended using java classes.

Hadoop do has capability to work with MR jobs created in other languages - it is called streaming. This model only allow us to define mapper and reducer with some restrictions not present in java. In the same time - input/output formats and other plugins do have to be written as java classes
So I would define decision making as following: a) Use Java, unless you have serious codebase you need to resue in Your MR job. b) Consider to use python when you need to create some simple ad hoc jobs.

Regarding Pig / Hive - it is also java centric systems of higher level. Hive can be used without any programming at all, but it can be is extended using java. Pig require java from the beginning. I think this systems are almost always preferable to MR jobs in cases when they can be appliaed. Usually these are cases when processing is SQL like.

Performance considerations between streaming vs native Java.
Streaming feeds input to the mapper via its input stream. It is interprocess communication which is inherently less efficient then in-process data passing between record reader and mapper in case of java.
I can make a following conclusions from above: a) In case of some light processing (like looking for substring, counting ...) this overhead can be significan and java solution will be more efficient.
b) In case of some heavy processing, which can be potentially implemented in some non-java language more efficiently - streaming based solution can have some edge.

Pig / Hive performance considerations.
Pig / Hive both implements primitives of the SQL processing. In other words - they implement elements of the execution plan in the RDBMS world. These implementations are good and well tuned. In the same time Hive (something I know better) is interpreter. It does not do code generation - it inteprpret execution plan within pre-built MR job(s). It mean that if you have sompe complex condtions and will write code specially for them - it have all chances to do much better then Hive - representing performance advantage of compiler vs interpeter.

2
votes

Regarding Java vs. Pig - I'd use pig in most cases (along with Java UDFs) for flexibility and for letting someone else (pig) to figure out what the best way to split the job in to maps and reduces, combiners etc.

I use Java when I absolutely want to control each and every aspect of the job.

Regarding the use of python (or other langs) that's something I'd use if the developers are more comfortable with these other languages. Note that you can also mix pig and streaming

1
votes

There is Scala, where you can write much simpler code for your jobs. For example, check out: https://github.com/NICTA/scoobi

You probably can have some incentive to use C++ for tasks that are more Memory or CPU intensive. You can read what Hypertable wrote about their C++ Decision: http://code.google.com/p/hypertable/wiki/WhyWeChoseCppOverJava

Java is also problematic on the Serialization side, as it creates an Object for any object that it reads from an input stream. You need to be careful not to use Java Serialization, just because you have Java implementation.