0
votes

i am following the python guide beam spark runner,and the beam_pipeline can submit job to a local jobserver which is launched by ./gradlew :runners:spark:job-server:runShadow with a local spark, and the addition parameter-PsparkMasterUrl=spark://localhost:7077 to a pre-deployed spark.

But i have a spark cluster on yarn, i set the launch command as ./gradlew :runners:spark:job-server:runShadow -PsparkMasterUrl=yarn(also tried yarn-client), but only get org.apache.spark.SparkException: Could not parse Master URL: 'yarn'

and the source code of the spark runner(beam\sdks\python\apache_beam\runners\portability\spark_runnner.py) shows that:

parser.add_argument('--spark_master_url',
                        default='local[4]',
                        help='Spark master URL (spark://HOST:PORT). '
                             'Use "local" (single-threaded) or "local[*]" '
                             '(multi-threaded) to start a local cluster for '
                             'the execution.')

it doesn't mention 'yarn', and the Provided SparkContext and StreamingListeners are not supported on the Spark portable runner. So does that meaning apache_beam spark runner with python can't be implemented on remote spark cluster(yarn mostly) and can only be test locally? or maybe i can set the job_endpoint as the remote job server url of my spark cluster.

and the every ./gradlew command blocked at 98%,but the jab server started with info like that:

19/11/28 13:47:48 INFO org.apache.beam.runners.fnexecution.jobsubmission.JobServerDriver: JobService started on localhost:8099
<============-> 98% EXECUTING [16s]
> IDLE
> :runners:spark:job-server:runShadow
> IDLE
1

1 Answers

0
votes

So does that meaning apache_beam spark runner with python can't be implemented on remote spark cluster(yarn mostly)

We've recently added portable Spark jars, which can be submitted via spark-submit. This feature isn't scheduled be included a Beam release until 2.19.0, however.

I created a JIRA ticket to track the status of YARN support, in case there are other related issues that need to be addressed.

and the every ./gradlew command blocked at 98%

That's expected behavior. The job server will stay running until canceled.