Apache spark using native dependencies - driver/executor code flow in standalone mode

Question

I have setup spark in standalone mode (single node in my laptop), tried to integrate opencv to read a set of images from a directory and detect faces in each image. I am trying to understand how the native dependencies are shipped to executor jvm, I would have thought that in the given below program, System.loadLibrary function would get executed as part of driver jvm, and the executor jvm would fail when the anonymous function tries to find the native library. But contrary to my understanding, the program works fine. Can someone explain how this works and what part of code is shipped from driver to executor.

 public static void main( String[] args )
 {
    SparkConf conf = new SparkConf().setMaster("spark://localhost:7077").setAppName("Image detect App");
    JavaSparkContext sc = new JavaSparkContext(conf);
    System.loadLibrary(Core.NATIVE_LIBRARY_NAME);

    CascadeClassifier faceDetector = new CascadeClassifier("/home/xx/Project/opencv-3.1.0/data/haarcascades_cuda/haarcascade_frontalface_alt.xml");

    File tempDir = new File("/home/xx/images/new");
    String tempDirName = tempDir.getAbsolutePath();

    JavaPairRDD<String, PortableDataStream> readRDD = sc.binaryFiles(tempDirName,3);
    List<Tuple2<String, PortableDataStream>> result = readRDD.collect();
    for (Tuple2<String, PortableDataStream> res : result) 
    {
        Mat image = Imgcodecs
                .imread(res._1().replace("file:",""));

        MatOfRect faceDetections = new MatOfRect();
        faceDetector.detectMultiScale(image, faceDetections);

        for (Rect rect : faceDetections.toArray()) {
            Imgproc.rectangle(image, new Point(rect.x, rect.y), new Point(rect.x + rect.width, rect.y + rect.height),
                    new Scalar(0, 255, 0));
        }
        String filename = res._1().replace("file:","") + "_out";
        Imgcodecs.imwrite(filename, image);
    }

} Have created a jar with the above program and ran the following spark submit command, it works fine as expected.

./bin/spark-submit --verbose --master spark://localhost:7077 --num-executors 2 --class com.xxx.MainSparkImage --jars /home/xx/Project/opencv-3.1.0/release/bin/opencv-310.jar --driver-library-path /home/xx/Project/opencv-3.1.0/release/lib /home/xx/ImageProcess.jar

Thanks Srivatsan

spark.driver.extraClassPath spark.executor.extraClassPath have you tried these during spark submit? — Ram Ghadiyaram
Will try spark.executor.extraClassPath, going by its name, looks like that should be the one to ship the dependencies to the executor. Thanks a lot. — Srivatsan Nallazhagappan

Pranav Shukla Pranav Shukla · Accepted Answer · 2016-05-02T05:53:27

List<Tuple2<String, PortableDataStream>> result = readRDD.collect();

This line would cause the RDD to be collected back to the driver as a local collection. The rest of the code (for loop) executes locally within the driver. Hence you don't see any errors related to missing native libraries on the executors.

Apache spark using native dependencies - driver/executor code flow in standalone mode

1 Answers