1
votes

Trying to run Spark-Wiki-Parser on a GCP Dataproc cluster. The code takes in two arguments "dumpfile" and "destloc". When I submit the following I get a [scallop] Error: Excess arguments provided: 'gs://enwiki-latest-pages-articles.xml.bz2 gs://output_dir/'.

gcloud dataproc jobs submit spark --cluster $CLUSTER_NAME --project $CLUSTER_PROJECT \
--class 'com.github.nielsenbe.sparkwikiparser.wikipedia.sparkdbbuild.DatabaseBuildMain' \
--properties=^#^spark.jars.packages='com.databricks:spark-xml_2.11:0.5.0,com.github.nielsenbe:spark-wiki-parser_2.11:1.0' \
--region=$CLUSTER_REGION \
-- 'gs://enwiki-latest-pages-articles.xml.bz2' 'gs://output_dir/'

How do I get the code to recognize the input arguments?

2
Could you share your code you use to parse the input arguments?Dennis Huo

2 Answers

1
votes

I spent probably 8 hours figuring this out, but figured I'd dump the solution here since it had not been shared yet.

The gcloud CLI separates the dataproc parameters from the class arguments by -- as noted by another user. However, Scallop also requires a -- prior to each named argument. Your cli should look something like this.

gcloud dataproc jobs submit spark --cluster $CLUSTER_NAME --project $CLUSTER_PROJECT --class 'com.github.nielsenbe.sparkwikiparser.wikipedia.sparkdbbuild.DatabaseBuildMain'
--properties=^#^spark.jars.packages='com.databricks:spark-xml_2.11:0.5.0,com.github.nielsenbe:spark-wiki-parser_2.11:1.0'
--region=$CLUSTER_REGION
-- --dumpfile'gs://enwiki-latest-pages-articles.xml.bz2' --destloc 'gs://output_dir/'

0
votes

It seems like Scala class needs dumpfile and destloc as args. Could you run following command instead and try if it works?

gcloud dataproc jobs submit spark --cluster $CLUSTER_NAME --project $CLUSTER_PROJECT \
--class 'com.github.nielsenbe.sparkwikiparser.wikipedia.sparkdbbuild.DatabaseBuildMain' \
--properties=^#^spark.jars.packages='com.databricks:spark-xml_2.11:0.5.0,com.github.nielsenbe:spark-wiki-parser_2.11:1.0' \
--region=$CLUSTER_REGION \
-- dumpfile gs://enwiki-latest-pages-articles.xml.bz2 destloc gs://output_dir/