1
votes

I'm creating an EMR cluster through the AWS EMR interface, but this time I'm trying to use HIVE and s3.

So far I'm just trying to do something very simple: creating tables from existing parquet files into hive.

from pyspark.sql import SparkSession


warehouse_location = "s3a://bucket/databricks_warehouse"
data_location = "s3a://bucket/report_emr_interim"

def register_table(table_name, spark):
    print(data_location + "/" + table_name)
    data_location_final = data_location
    if table_name == 'ds_ad_mapping':
        data_location_final = 's3a://bucket/ds_report/parquet'
    spark.read.parquet("{}/{}".format(data_location_final, table_name))\
      .createOrReplaceTempView("{}_tmp".format(table_name))
    spark.sql("CREATE TABLE IF NOT EXISTS {0} LIKE {0}_tmp LOCATION '{1}/{0}'".format(table_name, warehouse_location))
    spark.sql("DESC {}".format(table_name))


if __name__ == "__main__":
    spark = SparkSession.builder\
      .appName("spark")\
      .config("spark.executor.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true") \
      .config("spark.driver.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true") \
      .enableHiveSupport()\
      .getOrCreate()



    tables = [
        "adwords_ad",
        "adwords_adgroup",
        "adwords_accounts",
        "adwords_duration",
        "adwords_duration_hour",
        "ds_conversion",
        "ds_visit",
        "ds_visit_adgroup_engine_id",
        "ds_conversion_adgroup_engine_id",
        "ds3_adwords_adgroup_hourly",
        "ds_ad_mapping",
        "ds_conversion_adgroup_engine_id",
        "sc_raw_report"
    ]

    [register_table(table, spark) for table in tables]

Although it works very well on local with spark-submit:

./bin/spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.3 --conf spark.hadoop.fs.s3a.access.key=AWS_ACCESS_KEY --conf spark.hadoop.fs.s3a.secret.key=AWS_ACCESS_KEY --conf spark.executor.memoryOverhead=2g --driver-memory 5g --executor-cores 1 --executor-memory 6g --num-executors 1 ~/db_migration/make_metastore.py

the same config (3 m5.xlarge - just for the sake of testing) on EMR with the same file refuses to work.

The stderr from the container is really poor (I put it in a code snippet for post readability):

19/09/05 14:22:17 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-172-31-46-157.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1567693194513_0001/pyspark.zip
19/09/05 14:22:17 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.7-src.zip -> hdfs://ip-172-31-46-157.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1567693194513_0001/py4j-0.10.7-src.zip
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.hadoop_hadoop-aws-2.7.3.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.hadoop_hadoop-common-2.7.3.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.fasterxml.jackson.core_jackson-databind-2.2.3.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.fasterxml.jackson.core_jackson-annotations-2.2.3.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.amazonaws_aws-java-sdk-1.7.4.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.hadoop_hadoop-annotations-2.7.3.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.google.guava_guava-11.0.2.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-cli_commons-cli-1.2.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.commons_commons-math3-3.1.1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/xmlenc_xmlenc-0.52.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-httpclient_commons-httpclient-3.1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-codec_commons-codec-1.4.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-io_commons-io-2.4.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-net_commons-net-3.1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-collections_commons-collections-3.2.2.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/javax.servlet_servlet-api-2.5.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.mortbay.jetty_jetty-6.1.26.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.mortbay.jetty_jetty-util-6.1.26.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.sun.jersey_jersey-core-1.9.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.sun.jersey_jersey-json-1.9.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.sun.jersey_jersey-server-1.9.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-logging_commons-logging-1.1.3.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/log4j_log4j-1.2.17.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/net.java.dev.jets3t_jets3t-0.9.0.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-lang_commons-lang-2.6.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-configuration_commons-configuration-1.6.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.slf4j_slf4j-api-1.7.10.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-core-asl-1.9.13.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.avro_avro-1.7.4.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.google.protobuf_protobuf-java-2.5.0.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.google.code.gson_gson-2.2.4.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.hadoop_hadoop-auth-2.7.3.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.jcraft_jsch-0.1.42.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.curator_curator-client-2.7.1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.curator_curator-recipes-2.7.1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.google.code.findbugs_jsr305-3.0.0.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.htrace_htrace-core-3.1.0-incubating.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.zookeeper_zookeeper-3.4.6.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.commons_commons-compress-1.4.1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.codehaus.jettison_jettison-1.1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.sun.xml.bind_jaxb-impl-2.2.3-1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-jaxrs-1.9.13.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-xc-1.9.13.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/javax.xml.bind_jaxb-api-2.2.2.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/javax.xml.stream_stax-api-1.0-2.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/javax.activation_activation-1.1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/asm_asm-3.2.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.httpcomponents_httpclient-4.2.5.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.httpcomponents_httpcore-4.2.5.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.jamesmurty.utils_java-xmlbuilder-0.4.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-digester_commons-digester-1.8.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-beanutils_commons-beanutils-core-1.8.0.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-beanutils_commons-beanutils-1.7.0.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.thoughtworks.paranamer_paranamer-2.3.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.xerial.snappy_snappy-java-1.0.4.1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.tukaani_xz-1.0.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.directory.server_apacheds-kerberos-codec-2.0.0-M15.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.curator_curator-framework-2.7.1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.directory.server_apacheds-i18n-2.0.0-M15.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.directory.api_api-asn1-api-1.0.0-M20.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.directory.api_api-util-1.0.0-M20.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.slf4j_slf4j-log4j12-1.7.10.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/io.netty_netty-3.6.2.Final.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/javax.servlet.jsp_jsp-api-2.1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/jline_jline-0.9.94.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/junit_junit-4.11.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.hamcrest_hamcrest-core-1.3.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.fasterxml.jackson.core_jackson-core-2.2.3.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/joda-time_joda-time-2.10.3.jar added multiple times to distributed cache.
19/09/05 14:22:18 INFO Client: Uploading resource file:/mnt/tmp/spark-ee25ac0f-c8d4-41f8-ba70-21d5ba36e840/__spark_conf__9108363750759351789.zip -> hdfs://ip-172-31-46-157.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1567693194513_0001/__spark_conf__.zip
19/09/05 14:22:18 INFO SecurityManager: Changing view acls to: hadoop
19/09/05 14:22:18 INFO SecurityManager: Changing modify acls to: hadoop
19/09/05 14:22:18 INFO SecurityManager: Changing view acls groups to: 
19/09/05 14:22:18 INFO SecurityManager: Changing modify acls groups to: 
19/09/05 14:22:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
19/09/05 14:22:20 INFO Client: Submitting application application_1567693194513_0001 to ResourceManager
19/09/05 14:22:21 INFO YarnClientImpl: Submitted application application_1567693194513_0001
19/09/05 14:22:22 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:22 INFO Client: 
	 client token: N/A
	 diagnostics: AM container is launched, waiting for AM container to Register with RM
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1567693340938
	 final status: UNDEFINED
	 tracking URL: http://ip-172-31-46-157.eu-west-1.compute.internal:20888/proxy/application_1567693194513_0001/
	 user: hadoop
19/09/05 14:22:23 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:24 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:25 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:26 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:27 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:28 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:29 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:30 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:31 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:32 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:33 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:34 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:35 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:36 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:37 INFO Client: Application report for application_1567693194513_0001 (state: FAILED)
19/09/05 14:22:37 INFO Client: 
	 client token: N/A
	 diagnostics: Application application_1567693194513_0001 failed 2 times due to AM Container for appattempt_1567693194513_0001_000002 exited with  exitCode: 13
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1567693194513_0001_02_000001
Exit code: 13
Stack trace: ExitCodeException exitCode=13: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
	at org.apache.hadoop.util.Shell.run(Shell.java:869)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)


Container exited with a non-zero exit code 13
For more detailed output, check the application tracking page: http://ip-172-31-46-157.eu-west-1.compute.internal:8088/cluster/app/application_1567693194513_0001 Then click on links to logs of each attempt.
. Failing the application.
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1567693340938
	 final status: FAILED
	 tracking URL: http://ip-172-31-46-157.eu-west-1.compute.internal:8088/cluster/app/application_1567693194513_0001
	 user: hadoop
19/09/05 14:22:37 ERROR Client: Application diagnostics message: Application application_1567693194513_0001 failed 2 times due to AM Container for appattempt_1567693194513_0001_000002 exited with  exitCode: 13
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1567693194513_0001_02_000001
Exit code: 13
Stack trace: ExitCodeException exitCode=13: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
	at org.apache.hadoop.util.Shell.run(Shell.java:869)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)


Container exited with a non-zero exit code 13
For more detailed output, check the application tracking page: http://ip-172-31-46-157.eu-west-1.compute.internal:8088/cluster/app/application_1567693194513_0001 Then click on links to logs of each attempt.
. Failing the application.
Exception in thread "main" org.apache.spark.SparkException: Application application_1567693194513_0001 finished with failed status
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1148)
	at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1525)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:857)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:932)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:941)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
19/09/05 14:22:37 INFO ShutdownHookManager: Shutdown hook called
19/09/05 14:22:37 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-ee25ac0f-c8d4-41f8-ba70-21d5ba36e840
19/09/05 14:22:37 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-a9a4dd32-7d07-433a-9a62-0203cf1d7af1
Command exiting with ret '1'

I'm stuck, I don't know what to do or to proceed to debug this. What would be the best practices for it?

1

1 Answers

1
votes

As I was expecting this is a tough subject as various approaches are possible.

In my case the error, surprisingly silent on local was:

SyntaxError: Non-ASCII character '\xe2' in file python_file.py on line 106, but no encoding declared

But this isn't showing on stderr and was quite hard to spot.

The thing that I have learnt and would like to share here for anyone having difficulties to debug an EMR cluster is to go through the following steps:

  • Check the step stderr (most of the time this is only useful if you didn't provision correctly your instances compare to spark-submit configuration)

  • Google the error, even if most of the time it won't be useful at all. On my case it was errorCode: 13, which is related to troubles when having SparkSession.master("local[*]") in your code while defining a master in spark-submit, thus not related at all here

  • Check the containers logs (under summary -> configurations details), first the stdout and then stderr if stdout wasn't enough

It is surprisingly simple once you're at ease with EMR, and this answer is only targeted to people who would feel lost about it.