Pig is not able to process big file

Question

I am new to Hadoop and Pig.

I have setup Hadoop cluster with 3 node. I have written a Pig script which is normally reading data and executing aggregated functions on the it.

When I am executing 4.8G file with 36 Million Records pig is giving output in 51 minutes.

When I am executing 9.6G file with 72 Million Records pig script is crashing and Hadoop is giving following error.

Unable to recreate exception from backed error: AttemptID:attempt_1389348682901_0050_m_000005_3 Info:Container killed by the ApplicationMaster.
Job failed, hadoop does not return any error message

I am using Hadoop 2.2.0 and Pig 0.12.0.

My nodes configuration are

Master: 2 CPU, 2 GB RAM Slave1: 2 CPU, 2 GB RAM Slave2: 1 CPU, 2 GB RAM

Could you please advice me on this?

Is it possible there is some logical error in your Pig script? See stackoverflow.com/questions/12874975/… — Jakub Kotowski
Same script is running successfully for 4.8G file with 36 Million Records. What I observer during LOAD operation script getting failed. Nodes is not able to process/LOAD 9 GB file. Can we make LOAD operation parallel? — Bhagwant
Ah, right, it worked once. If the file is in a splittable format then you don't need to worry about it being big. Maybe LOAD could fail due to some syntax errors of the input file, although normally it should just skip a broken record, resp. put null in its place. It's difficult to guess without seeing the logs, your data and your script. — Jakub Kotowski
Can I use Hive instead of Pig. Will hive help me to do parallel processing. I am doing Group By, Sum and Avg kind of operation on my data. I can show my script. — Bhagwant
You could surely try Hive but without knowing why this problem appears for you in Pig there's no way of saying whether it will appear with Hive too. Pig should be fine for computing this kind of aggregates. — Jakub Kotowski

Bhagwant Bhagwant · Accepted Answer · 2014-01-20T06:44:50

After trying things with Pig. I moved to Hive.

What I observed when I was using Pig:

I was uploading file in HDFS and loading it in Pig. So Pig was again loading that file. I was processing file twice.

For my scenario Hive fits. I am uploading file in HDFS and loading that file in Hive. It takes few milliseconds. Because Hive is seamlessly working with HDFS files. So no need to load data again in Hive tables. That saves lots of time.

Both components are good, for me Hive fits.

Thanks all for your time and advice.

Pig is not able to process big file

1 Answers