14
votes

I manage a small team of developers and at any given time we have several on going (one-off) data projects that could be considered "Embarrassingly parallel" - These generally involve running a single script on a single computer for several days, a classic example would be processing several thousand PDF files to extract some key text and place into a CSV file for later insertion into a database.

We are now doing enough of these type of tasks that I started to investigate developing a simple job queue system using RabbitMQ with a few spare servers (with an eye to use Amazon SQS/S3/EC2 for projects that needed larger scaling)

In searching for examples of others doing this I keep coming across the classic Hadoop New York Times example:

The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4 TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not including bandwidth)

Which sounds perfect? So I researched Hadoop and Map/Reduce.

But what I can't work out is how they did it? Or why they did it?

Converting TIFF's in PDF's is not a Map/Reduce problem surely? Wouldn't a simple Job Queue have been better?

The other classic Hadoop example is the "wordcount" from the Yahoo Hadoop Tutorial seems a perfect fit for Map/Reduce, and I can see why it is such a powerful tool for Big Data.

I don't understand how these "Embarrassingly parallel" tasks are put into the Map/Reduce pattern?

TL;DR

This is very much a conceptual question, basically I want to know how would I fit a task of "processing several thousand PDF files to extract some key text and place into a CSV file" into a Map/Reduce pattern?

If you know of any examples that would be perfect, I'm not asking you to write it for me.

(Notes: We have code to process the PDF's, I'm not asking for that - it's just an example, it could be any task. I'm asking about putting that processes like that into the Hadoop Map/Reduce pattern - when there is no clear "Map" or "Reduce" elements to a task.)

Cheers!

2
... Each node would process n PDF files and output whatever. I don't know as it's worth using Hadoop since it's simple enough to just throw messages around-likely only if you already have clusters up and running. Hadoop can be used for fairly arbitrary tasks, but that doesn't mean it should be.Dave Newton
I will add to the answers that said Hadoop does parallel processing and failover. So as many existing Job Queues (like rabbitmq) when you are using them as pull/worker. Any job fails will be redelivered again and again until one worker acknowledges the broker that it's been done. Only Parallel processing and JobTracker cannot be the only reasons to use Hadoop. Installing, configuring and maintaining Hadoop clusters are really time consuming. Not to mention you need professionals to do that. See if what you want can fit the distributed systems by using Job Queues first, then consider Hadoop.Maziyar

2 Answers

5
votes

Your thinking is right.

The above examples that you mentioned used only part of the solution that hadoop offers. They definitely used parallel computing ability of hadoop plus the distributed file system. It's not necessary that you always will need a reduce step. You may not have any data interdependency between the parallel processes that are run. in which case you will eliminate the reduce step.

I think your problem also will fit into hadoop solution domain.

You have huge data - huge number of PDF files And a long running job

You can process these files parallely by placing your files on HDFS and running a MapReduce job. Your processing time theoretically improves by the number of nodes that you have on your cluster. If you do not see the need to aggregate the data sets that are produced by the individual threads you do not need to use a reduce step else you need to design a reduce step as well.

The thing here is if you do not need a reduce step, you are just leveraging the parallel computing ability of hadoop plus you are equipped to run your jobs on not so expensive hardware.

1
votes

I need to add one more thing: error handling and retry. In a distributed environment nodes fail is pretty common. I regularly run EMR cluster consisting of several hundred nodes at time for 3 - 8 days and find out that 3 or 4 fail during that period is very likely. Hadoop JobTracker will nicely re-submit failed tasks (up to a certain number of times) in a different node.