0
votes

In MapReduce we need to write bash scripts and run jobs for getting data. I want to get data easily like we query in sql in order to get data. We can use Hive, Pig, HBase, Sqoop, Flume, Oozie, ZooKeeper, and Hue for such purpose.

  • But which is best to use here?
  • And do all these frameworks use MapReduce in background?
1
How is this related to facebook?Thomas Jungblut
Yeah, and now what? Yahoo is using it as well as thousands of other companies.Thomas Jungblut

1 Answers

0
votes

As for as data analysis goes, MapReduce is your only native option for querying data in HDFS or any of Hadoop's other supported file systems. That said, solutions such as Hive and Pig create an abstraction on top of Hadoop, allowing you to write PigLatin or Hive-SQL instead of Java. Pig and Hive both compile down to MapReduce.

Another alternative is using Hadoop Streaming, which lets you write MapReduce in any language, including Python, Ruby, bash, etc.

As for which option is better, that's your decision to make. MapReduce in Java will always be the fastest, because it's native and you have controls to fine-tune your jobs. But Hive and Pig are significantly faster to develop and easier to maintain. Streaming is great for people who don't like or know Java but still want more control than Hive and Pig, though these days Hive and Pig are pretty mature and very flexible.