Maximum file size that can be processed using Hadoop in 'pseudo distributed' mode

Question

I am processing a file with 7+ million lines (~59 MB) in Ubuntu 11.04 machine with this configuration:

Intel(R) Core(TM)2 Duo CPU     E8135  @ 2.66GHz, 2280 MHz
Memory: 2GB
Disk: 100GB

Even after running for 45 Minutes, I didn't see any progress.

Deleted hdfs://localhost:9000/user/hadoop_admin/output
packageJobJar: [/home/hadoop_admin/Documents/NLP/Dictionary/dict/drugs.csv, /usr/local/hadoop/mapper.py, /usr/local/hadoop/reducer.py, /tmp/hadoop-hadoop_admin/hadoop-unjar8773176795802479000/] [] /tmp/streamjob582836411271840475.jar tmpDir=null
11/07/22 10:39:20 INFO mapred.FileInputFormat: Total input paths to process : 1
11/07/22 10:39:21 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-hadoop_admin/mapred/local]
11/07/22 10:39:21 INFO streaming.StreamJob: Running job: job_201107181559_0099
11/07/22 10:39:21 INFO streaming.StreamJob: To kill this job, run:
11/07/22 10:39:21 INFO streaming.StreamJob: /usr/local/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201107181559_0099
11/07/22 10:39:21 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201107181559_0099
11/07/22 10:39:22 INFO streaming.StreamJob:  map 0%  reduce 0%

What is the maximum possible file size that can be processed using Hadoop in pseudo distributed mode.

Updated:

I am doing a simple wordcount application using Hadoop Streaming. My mapper.py and reducer.py took around 50 Sec to process a file with 220K lines (~19MB).

Maggie Maggie · Accepted Answer · 2011-07-22T04:01:41

Problem solved, I didn't kill the previous jobs so this job joined the queue, thats why it got delayed. I Used bin/hadoop -kill <job_id> to kill all the pending jobs. It took ~140 Sec to process the whole file (~59 MB) in pseudo distributed mode

Maximum file size that can be processed using Hadoop in 'pseudo distributed' mode

2 Answers