I'm creating an EMR cluster through the AWS EMR interface, but this time I'm trying to use HIVE and s3.
So far I'm just trying to do something very simple: creating tables from existing parquet files into hive.
from pyspark.sql import SparkSession
warehouse_location = "s3a://bucket/databricks_warehouse"
data_location = "s3a://bucket/report_emr_interim"
def register_table(table_name, spark):
print(data_location + "/" + table_name)
data_location_final = data_location
if table_name == 'ds_ad_mapping':
data_location_final = 's3a://bucket/ds_report/parquet'
spark.read.parquet("{}/{}".format(data_location_final, table_name))\
.createOrReplaceTempView("{}_tmp".format(table_name))
spark.sql("CREATE TABLE IF NOT EXISTS {0} LIKE {0}_tmp LOCATION '{1}/{0}'".format(table_name, warehouse_location))
spark.sql("DESC {}".format(table_name))
if __name__ == "__main__":
spark = SparkSession.builder\
.appName("spark")\
.config("spark.executor.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true") \
.config("spark.driver.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true") \
.enableHiveSupport()\
.getOrCreate()
tables = [
"adwords_ad",
"adwords_adgroup",
"adwords_accounts",
"adwords_duration",
"adwords_duration_hour",
"ds_conversion",
"ds_visit",
"ds_visit_adgroup_engine_id",
"ds_conversion_adgroup_engine_id",
"ds3_adwords_adgroup_hourly",
"ds_ad_mapping",
"ds_conversion_adgroup_engine_id",
"sc_raw_report"
]
[register_table(table, spark) for table in tables]
Although it works very well on local with spark-submit:
./bin/spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.3 --conf spark.hadoop.fs.s3a.access.key=AWS_ACCESS_KEY --conf spark.hadoop.fs.s3a.secret.key=AWS_ACCESS_KEY --conf spark.executor.memoryOverhead=2g --driver-memory 5g --executor-cores 1 --executor-memory 6g --num-executors 1 ~/db_migration/make_metastore.py
the same config (3 m5.xlarge - just for the sake of testing) on EMR with the same file refuses to work.
The stderr from the container is really poor (I put it in a code snippet for post readability):
19/09/05 14:22:17 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-172-31-46-157.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1567693194513_0001/pyspark.zip
19/09/05 14:22:17 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.7-src.zip -> hdfs://ip-172-31-46-157.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1567693194513_0001/py4j-0.10.7-src.zip
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.hadoop_hadoop-aws-2.7.3.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.hadoop_hadoop-common-2.7.3.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.fasterxml.jackson.core_jackson-databind-2.2.3.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.fasterxml.jackson.core_jackson-annotations-2.2.3.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.amazonaws_aws-java-sdk-1.7.4.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.hadoop_hadoop-annotations-2.7.3.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.google.guava_guava-11.0.2.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-cli_commons-cli-1.2.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.commons_commons-math3-3.1.1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/xmlenc_xmlenc-0.52.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-httpclient_commons-httpclient-3.1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-codec_commons-codec-1.4.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-io_commons-io-2.4.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-net_commons-net-3.1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-collections_commons-collections-3.2.2.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/javax.servlet_servlet-api-2.5.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.mortbay.jetty_jetty-6.1.26.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.mortbay.jetty_jetty-util-6.1.26.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.sun.jersey_jersey-core-1.9.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.sun.jersey_jersey-json-1.9.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.sun.jersey_jersey-server-1.9.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-logging_commons-logging-1.1.3.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/log4j_log4j-1.2.17.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/net.java.dev.jets3t_jets3t-0.9.0.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-lang_commons-lang-2.6.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-configuration_commons-configuration-1.6.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.slf4j_slf4j-api-1.7.10.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-core-asl-1.9.13.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.avro_avro-1.7.4.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.google.protobuf_protobuf-java-2.5.0.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.google.code.gson_gson-2.2.4.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.hadoop_hadoop-auth-2.7.3.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.jcraft_jsch-0.1.42.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.curator_curator-client-2.7.1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.curator_curator-recipes-2.7.1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.google.code.findbugs_jsr305-3.0.0.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.htrace_htrace-core-3.1.0-incubating.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.zookeeper_zookeeper-3.4.6.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.commons_commons-compress-1.4.1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.codehaus.jettison_jettison-1.1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.sun.xml.bind_jaxb-impl-2.2.3-1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-jaxrs-1.9.13.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-xc-1.9.13.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/javax.xml.bind_jaxb-api-2.2.2.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/javax.xml.stream_stax-api-1.0-2.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/javax.activation_activation-1.1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/asm_asm-3.2.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.httpcomponents_httpclient-4.2.5.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.httpcomponents_httpcore-4.2.5.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.jamesmurty.utils_java-xmlbuilder-0.4.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-digester_commons-digester-1.8.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-beanutils_commons-beanutils-core-1.8.0.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/commons-beanutils_commons-beanutils-1.7.0.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.thoughtworks.paranamer_paranamer-2.3.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.xerial.snappy_snappy-java-1.0.4.1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.tukaani_xz-1.0.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.directory.server_apacheds-kerberos-codec-2.0.0-M15.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.curator_curator-framework-2.7.1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.directory.server_apacheds-i18n-2.0.0-M15.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.directory.api_api-asn1-api-1.0.0-M20.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.directory.api_api-util-1.0.0-M20.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.slf4j_slf4j-log4j12-1.7.10.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/io.netty_netty-3.6.2.Final.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/javax.servlet.jsp_jsp-api-2.1.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/jline_jline-0.9.94.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/junit_junit-4.11.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.hamcrest_hamcrest-core-1.3.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/com.fasterxml.jackson.core_jackson-core-2.2.3.jar added multiple times to distributed cache.
19/09/05 14:22:18 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/joda-time_joda-time-2.10.3.jar added multiple times to distributed cache.
19/09/05 14:22:18 INFO Client: Uploading resource file:/mnt/tmp/spark-ee25ac0f-c8d4-41f8-ba70-21d5ba36e840/__spark_conf__9108363750759351789.zip -> hdfs://ip-172-31-46-157.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1567693194513_0001/__spark_conf__.zip
19/09/05 14:22:18 INFO SecurityManager: Changing view acls to: hadoop
19/09/05 14:22:18 INFO SecurityManager: Changing modify acls to: hadoop
19/09/05 14:22:18 INFO SecurityManager: Changing view acls groups to:
19/09/05 14:22:18 INFO SecurityManager: Changing modify acls groups to:
19/09/05 14:22:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
19/09/05 14:22:20 INFO Client: Submitting application application_1567693194513_0001 to ResourceManager
19/09/05 14:22:21 INFO YarnClientImpl: Submitted application application_1567693194513_0001
19/09/05 14:22:22 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:22 INFO Client:
client token: N/A
diagnostics: AM container is launched, waiting for AM container to Register with RM
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1567693340938
final status: UNDEFINED
tracking URL: http://ip-172-31-46-157.eu-west-1.compute.internal:20888/proxy/application_1567693194513_0001/
user: hadoop
19/09/05 14:22:23 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:24 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:25 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:26 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:27 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:28 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:29 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:30 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:31 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:32 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:33 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:34 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:35 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:36 INFO Client: Application report for application_1567693194513_0001 (state: ACCEPTED)
19/09/05 14:22:37 INFO Client: Application report for application_1567693194513_0001 (state: FAILED)
19/09/05 14:22:37 INFO Client:
client token: N/A
diagnostics: Application application_1567693194513_0001 failed 2 times due to AM Container for appattempt_1567693194513_0001_000002 exited with exitCode: 13
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1567693194513_0001_02_000001
Exit code: 13
Stack trace: ExitCodeException exitCode=13:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 13
For more detailed output, check the application tracking page: http://ip-172-31-46-157.eu-west-1.compute.internal:8088/cluster/app/application_1567693194513_0001 Then click on links to logs of each attempt.
. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1567693340938
final status: FAILED
tracking URL: http://ip-172-31-46-157.eu-west-1.compute.internal:8088/cluster/app/application_1567693194513_0001
user: hadoop
19/09/05 14:22:37 ERROR Client: Application diagnostics message: Application application_1567693194513_0001 failed 2 times due to AM Container for appattempt_1567693194513_0001_000002 exited with exitCode: 13
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1567693194513_0001_02_000001
Exit code: 13
Stack trace: ExitCodeException exitCode=13:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 13
For more detailed output, check the application tracking page: http://ip-172-31-46-157.eu-west-1.compute.internal:8088/cluster/app/application_1567693194513_0001 Then click on links to logs of each attempt.
. Failing the application.
Exception in thread "main" org.apache.spark.SparkException: Application application_1567693194513_0001 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1148)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1525)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:857)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:932)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:941)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
19/09/05 14:22:37 INFO ShutdownHookManager: Shutdown hook called
19/09/05 14:22:37 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-ee25ac0f-c8d4-41f8-ba70-21d5ba36e840
19/09/05 14:22:37 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-a9a4dd32-7d07-433a-9a62-0203cf1d7af1
Command exiting with ret '1'
I'm stuck, I don't know what to do or to proceed to debug this. What would be the best practices for it?