I am trying to run an EMR (1 master and 2 core nodes) step with a very simple python script that i uploaded to s3 to be used in EMR spark application step. This script reads a data.txt file in S3 and saves it back, and it can be seen below,
import pyspark
import boto3
sc = SparkContext()
text_file = sc.textFile('s3://First_bucket/data.txt')
text_file.repartition(1).saveAsTextFile('s3://First_bucket/logdata')
sc.stop()
However, this straight script does not cause a bug when import boto3 is not used. To fix this problem i have tried to add a bootstrap action with boto.sh file while I am creating my EMR cluster. The boto.sh file that i used is as the follow,
#!/bin/bash
sudo easy_install-3.6 pip
sudo pip install --target /usr/lib/spark/python/ boto3
Unfortunately, this just enabled boto3 library on master node not core nodes. Again the EMR step of doing this is failed, and the error log file is:
2020-02-08T20:56:49.698Z INFO Ensure step 4 jar file command-runner.jar
2020-02-08T20:56:49.699Z INFO StepRunner: Created Runner for step 4
INFO startExec 'hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster s3://First_bucket/data.py'
INFO Environment:
PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/aws/bin
LESS_TERMCAP_md=[01;38;5;208m
LESS_TERMCAP_me=[0m
HISTCONTROL=ignoredups
LESS_TERMCAP_mb=[01;31m
AWS_AUTO_SCALING_HOME=/opt/aws/apitools/as
UPSTART_JOB=rc
LESS_TERMCAP_se=[0m
HISTSIZE=1000
HADOOP_ROOT_LOGGER=INFO,DRFA
JAVA_HOME=/etc/alternatives/jre
AWS_DEFAULT_REGION=eu-central-1
AWS_ELB_HOME=/opt/aws/apitools/elb
LESS_TERMCAP_us=[04;38;5;111m
EC2_HOME=/opt/aws/apitools/ec2
TERM=linux
runlevel=3
LANG=en_US.UTF-8
AWS_CLOUDWATCH_HOME=/opt/aws/apitools/mon
MAIL=/var/spool/mail/hadoop
LESS_TERMCAP_ue=[0m
LOGNAME=hadoop
PWD=/
LANGSH_SOURCED=1
HADOOP_CLIENT_OPTS=-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-2V51S7I25TLLW/tmp
_=/etc/alternatives/jre/bin/java
CONSOLETYPE=serial
RUNLEVEL=3
LESSOPEN=||/usr/bin/lesspipe.sh %s
previous=N
UPSTART_EVENTS=runlevel
AWS_PATH=/opt/aws
USER=hadoop
UPSTART_INSTANCE=
PREVLEVEL=N
HADOOP_LOGFILE=syslog
PYTHON_INSTALL_LAYOUT=amzn
HOSTNAME=ip-***-***-***-***
HADOOP_LOG_DIR=/mnt/var/log/hadoop/steps/s-2V51S7I25TLLW
EC2_AMITOOL_HOME=/opt/aws/amitools/ec2
EMR_STEP_ID=s-2V51S7I25TLLW
SHLVL=5
HOME=/home/hadoop
HADOOP_IDENT_STRING=hadoop
INFO redirectOutput to /mnt/var/log/hadoop/steps/s-2V51S7I25TLLW/stdout
INFO redirectError to /mnt/var/log/hadoop/steps/s-2V51S7I25TLLW/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-2V51S7I25TLLW
INFO ProcessRunner started child process 22893
2020-02-08T20:56:49.705Z INFO HadoopJarStepRunner.Runner: startRun() called for s-2V51S7I25TLLW Child Pid: 22893
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 1 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 26 seconds
2020-02-08T20:57:15.787Z INFO Step created jobs:
2020-02-08T20:57:15.787Z WARN Step failed with exitCode 1 and took 26 seconds
My question is how to use EMR spark application step with python script that contains libraries such as boto3. Thank in advance.