Processing images using hadoop

Question

I'm new to hadoop and I'm going to develop an application which process multiple images using hadoop and show users the results live, while they computation is in progress. The basic approach is distribute executable and bunch of images and gather the results.

Can I get results interactively while the computing process is in progress?

Are there any other alternatives than hadoop streaming, for such use case?

How can I feed executable with images? I can't find any examples other than feeding it with stdin.

Hadoop streaming (aka MR) is batch oriented in nature. You need to look for frameworks which can process data in real time (like Storm/Samza/Spark) and can also support processing binary data. — Praveen Sripati

0x0FFF 0x0FFF · Accepted Answer · 2015-01-21T14:32:57

For processing images on Hadoop the best way to organize the computations would be:

Store images in a sequence file. Key - image name or its ID, Value - image binary data. This way you will have a single file with all the images you need to process. If you have images added dynamically to your system, consider aggregating them in daily sequence files. I don't think you should use any compression for this sequence file as the general compression algorithms does not work well with images
Process the images. Here you have a number of options to choose. First is to use Hadoop MapReduce and write a program in Java as with Java you would be able to read the sequence file and directly obtain the "Value" from it on each map step, where the "value" is the binary file data. Given this, you can run any processing logic. Second option is Hadoop Streaming. It has a limitation that all the data goes to stdin of your application and the result is read from stdout. But you can overcome this by writing your own InputFormat in Java that would serialize the image binary data from sequence file as Base64 string and pass it to your generic application. Third option would be to use Spark to process this data, but again you are limited in the programming languages choise: Scala, Java or Python.
Hadoop was developed to simplify batch processing over the large amounts of data. Spark is essentialy similar - it is a batch tool. This means you cannot get any result before all the data is processed. Spark Streaming is a bit different case - there you work with micro batches of 1-10 seconds and process each of them separately, so in general you can make it work for your case.

I don't know the complete case of yours, but one possible solution is to use Kafka + Spark Streaming. Your application should put the images in a binary format to the Kafka queue while Spark will consume and process them in micro batches on the cluster, updating the users through some third component (at least by putting the image processing status into the Kafka for another application to process it)

But in general, information you provided is not complete to recommend a good architecture for your specific case

Processing images using hadoop

3 Answers