7
votes

I'm new to hadoop and I'm going to develop an application which process multiple images using hadoop and show users the results live, while they computation is in progress. The basic approach is distribute executable and bunch of images and gather the results.

Can I get results interactively while the computing process is in progress?

Are there any other alternatives than hadoop streaming, for such use case?

How can I feed executable with images? I can't find any examples other than feeding it with stdin.

3
Hadoop streaming (aka MR) is batch oriented in nature. You need to look for frameworks which can process data in real time (like Storm/Samza/Spark) and can also support processing binary data.Praveen Sripati

3 Answers

3
votes

For processing images on Hadoop the best way to organize the computations would be:

  1. Store images in a sequence file. Key - image name or its ID, Value - image binary data. This way you will have a single file with all the images you need to process. If you have images added dynamically to your system, consider aggregating them in daily sequence files. I don't think you should use any compression for this sequence file as the general compression algorithms does not work well with images
  2. Process the images. Here you have a number of options to choose. First is to use Hadoop MapReduce and write a program in Java as with Java you would be able to read the sequence file and directly obtain the "Value" from it on each map step, where the "value" is the binary file data. Given this, you can run any processing logic. Second option is Hadoop Streaming. It has a limitation that all the data goes to stdin of your application and the result is read from stdout. But you can overcome this by writing your own InputFormat in Java that would serialize the image binary data from sequence file as Base64 string and pass it to your generic application. Third option would be to use Spark to process this data, but again you are limited in the programming languages choise: Scala, Java or Python.
  3. Hadoop was developed to simplify batch processing over the large amounts of data. Spark is essentialy similar - it is a batch tool. This means you cannot get any result before all the data is processed. Spark Streaming is a bit different case - there you work with micro batches of 1-10 seconds and process each of them separately, so in general you can make it work for your case.

I don't know the complete case of yours, but one possible solution is to use Kafka + Spark Streaming. Your application should put the images in a binary format to the Kafka queue while Spark will consume and process them in micro batches on the cluster, updating the users through some third component (at least by putting the image processing status into the Kafka for another application to process it)

But in general, information you provided is not complete to recommend a good architecture for your specific case

0
votes

As 0x0FFF says in another answer, the question does not provided enough details to recommend a proper architecture. Though this question is old, I'm just adding my research that i did on this topic so that it can help anyone in their research.

Spark is a great way of doing processing on distributed systems. But it doesn't have a strong community working on OpenCV. Storm is another Apache's free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing.

StormCV is an extension of Apache Storm specifically designed to support the development of distributed computer-vision pipelines. StormCV enables the use of Storm for video processing by adding computer vision (CV) specific operations and data model. The platform uses Open CV for most of its CV operations and it is relatively easy to use this library for other functions.

There are a few examples of using storm with OpenCV. There are examples on their official git hub page. You might want to look at this face detection example and try it to do human detection - https://github.com/sensorstorm/StormCV/blob/master/stormcv-examples/src/nl/tno/stormcv/example/E2_FacedetectionTopology.java.

0
votes

You can actually create your custom logic using Hadoop Storm framework. You can easily integrate any functionality of some specific Computer Vision library and distrubute it across the bolts of this framework. Besides Storm has a great extension called DRPC server which allows you to consume your logic as a simple RPC calls. You can find a simple example of how you can process video files through Storm using OpenCV face detection in my article Consuming OpenCV through Hadoop Storm DRPC Server from .NET