HDInsight/Spark Activity in Azure Data Factory v2 does not have option to specify --files parameter for spark-submit

Question

I have created a HDInsight Cluster (v4, Spark 2.4) in Azure and want to run a Spark.Ne app on this cluster through an Azure Data Factory v2 activity. In the Spark Activity it is possible to specify path to the jar, --class parameter and arguments to pass to the Spark app. The arguments are prefixed automatically with "-args" when run. But being able set the "--files" is necessary as it tells spark-submit what files that needs to be deployed to the worker nodes. In this case it is for distributing dll's with UDF-definitions. These files are necessary for the Spark to run. Since UDF's are a key component to Spark apps, I would have thought that this should be possible.

Spark Activity setup

If I SSH to the cluster and run the spark-submit command directly and specify the --files parameter, the Spark app works since the files are being distributed to the worker nodes.

spark-submit --deploy-mode cluster --master yarn --files wasbs://[email protected]/SparkJobs/mySparkApp.dll --class org.apache.spark.deploy.dotnet.DotnetRunner wasbs://[email protected]/SparkJobs/microsoft-spark-2.4.x-0.12.1.jar wasbs://[email protected]/SparkJobs/publish.zip mySparkApp

These are the guides that have been followed:

I don't know if there is an actual answer, maybe you need to raise it with ms support as adf sounds like it is wrong? What I would say is that I have found a few things in ADF that it couldn't do and I ended up writing an azure function and calling that from ADF. — Ed Elliott
Thanks Edd. I am now using Livy REST API directly to the HDInsight cluster to execute the Spark job. Ironically, Livy is what the ADF Spark Activity uses under the hood. But when I use livy directly, I can specify the --files param directly. — snapperhead1234

CHEEKATLAPRADEEP-MSFT CHEEKATLAPRADEEP-MSFT · Accepted Answer · 2020-10-30T04:22:49

You can pass arguments/parameter to Pyspark Script in Azure data Factory as shown below:

Code:

{
    "name": "SparkActivity",
    "properties": {
        "activities": [
            {
                "name": "Spark1",
                "type": "HDInsightSpark",
                "dependsOn": [],
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "rootPath": "adftutorial/spark/script",
                    "entryFilePath": "WordCount_Spark.py",
                    "arguments": [
                        "--input-file",
                        "wasb://[email protected]/data",
                        "--output-file",
                        "wasb://[email protected]/results"
                    ],
                    "sparkJobLinkedService": {
                        "referenceName": "AzureBlobStorage1",
                        "type": "LinkedServiceReference"
                    }
                },
                "linkedServiceName": {
                    "referenceName": "HDInsight",
                    "type": "LinkedServiceReference"
                }
            }
        ],
        "annotations": []
    },
    "type": "Microsoft.DataFactory/factories/pipelines"
}

How to pass arguments in ADF:

Some of the example for passing the parameters in Azure Data Factory:

{
    "name": "SparkSubmit",
    "properties": {
        "description": "Submit a spark job",
        "activities": [
            {
                "type": "HDInsightMapReduce",
                "typeProperties": {
                    "className": "com.adf.spark.SparkJob",
                    "jarFilePath": "libs/spark-adf-job-bin.jar",
                    "jarLinkedService": "StorageLinkedService",
                    "arguments": [
                        "--jarFile",
                        "libs/sparkdemoapp_2.10-1.0.jar",
                        "--jars",
                        "/usr/hdp/current/hadoop-client/hadoop-azure-2.7.1.2.3.3.0-3039.jar,/usr/hdp/current/hadoop-client/lib/azure-storage-2.2.0.jar",
                        "--mainClass",
                        "com.adf.spark.demo.Demo",
                        "--master",
                        "yarn-cluster",
                        "--driverMemory",
                        "2g",
                        "--driverExtraClasspath",
                        "/usr/lib/hdinsight-logging/*",
                        "--executorCores",
                        "1",
                        "--executorMemory",
                        "4g",
                        "--sparkHome",
                        "/usr/hdp/current/spark-client",
                        "--connectionString",
                        "DefaultEndpointsProtocol=https;AccountName=<YOUR_ACCOUNT>;AccountKey=<YOUR_KEY>",
                        "input=wasb://input@<YOUR_ACCOUNT>.blob.core.windows.net/data",
                        "output=wasb://output@<YOUR_ACCOUNT>.blob.core.windows.net/results"
                    ]
                },
                "inputs": [
                    {
                        "name": "input"
                    }
                ],
                "outputs": [
                    {
                        "name": "output"
                    }
                ],
                "policy": {
                    "executionPriorityOrder": "OldestFirst",
                    "timeout": "01:00:00",
                    "concurrency": 1,
                    "retry": 1
                },
                "scheduler": {
                    "frequency": "Day",
                    "interval": 1
                },
                "name": "Spark Launcher",
                "description": "Submits a Spark Job",
                "linkedServiceName": "HDInsightLinkedService"
            }
        ],
        "start": "2015-11-16T00:00:01Z",
        "end": "2015-11-16T23:59:00Z",
        "isPaused": false,
        "pipelineMode": "Scheduled"
    }
}

HDInsight/Spark Activity in Azure Data Factory v2 does not have option to specify --files parameter for spark-submit

2 Answers