2
votes

The dynamoDB table backup using data pipeline aws process got error as:

02 May 2017 07:19:04,544 [WARN] (TaskRunnerService-df-0940986HJGYQM1ZJ8BN_@EmrClusterForBackup_2017-04-25T13:31:55-2) df-0940986HJGYQM1ZJ8BN amazonaws.datapipeline.cluster.EmrUtil: EMR job flow named 'df-0940986HJGYQM1ZJ8BN_@EmrClusterForBackup_2017-04-25T13:31:55' with jobFlowId 'j-2SJ0OQOM0BTI' is in status 'RUNNING' because of the step 'df-0940986HJGYQM1ZJ8BN_@TableBackupActivity_2017-04-25T13:31:55_Attempt=2' failures 'null'
02 May 2017 07:19:04,544 [INFO] (TaskRunnerService-df-0940986HJGYQM1ZJ8BN_@EmrClusterForBackup_2017-04-25T13:31:55-2) df-0940986HJGYQM1ZJ8BN amazonaws.datapipeline.cluster.EmrUtil: EMR job '@TableBackupActivity_2017-04-25T13:31:55_Attempt=2' with jobFlowId 'j-2SJ0OQOM0BTI' is in  status 'RUNNING' and reason 'Running step'. Step 'df-0940986HJGYQM1ZJ8BN_@TableBackupActivity_2017-04-25T13:31:55_Attempt=2' is in status 'FAILED' with reason 'null'
02 May 2017 07:19:04,544 [INFO] (TaskRunnerService-df-0940986HJGYQM1ZJ8BN_@EmrClusterForBackup_2017-04-25T13:31:55-2) df-0940986HJGYQM1ZJ8BN amazonaws.datapipeline.cluster.EmrUtil: Collecting steps stderr logs for cluster with AMI 3.9.0
02 May 2017 07:19:04,558 [INFO] (TaskRunnerService-df-0940986HJGYQM1ZJ8BN_@EmrClusterForBackup_2017-04-25T13:31:55-2) df-0940986HJGYQM1ZJ8BN amazonaws.datapipeline.taskrunner.LogMessageUtil: Returning tail errorMsg :    at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
 at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:460)
 at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:343)
 at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
 at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
 at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
 at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
 at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
 at org.apache.hadoop.dynamodb.tools.DynamoDbExport.run(DynamoDbExport.java:79)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 at org.apache.hadoop.dynamodb.tools.DynamoDbExport.main(DynamoDbExport.java:30)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

There is large number of data (6 million). The Pipeline worked for 4 days and got error. Cant figure out the error.

1
Can you provide your Pipeline definition ?jc mannem
Also, try asking this on AWS Forums, where someone who can access your pipeline can tell more info.jc mannem
Did you find a solution for this problem?Nir
@Nir please take a look at my answer below ...azec-pdx

1 Answers

0
votes

From analyzing your logs , and particularly this line...

org.apache.hadoop.dynamodb.tools.DynamoDbExport

it seems you are running AWS Data Pipeline created on one of the pre-defined templates named "Export DynamoDB table to S3".

This Data Pipeline is taking several input parameters that you can edit in pipeline architect, but most important parameters are:

  1. myDDBTableName - the name of the DynamoDB table being exported.
  2. myOutputS3Loc - the full S3 path to where you want MapReduce job to export your data. This has to be of format s3:/// . The MR job will then export your data to S3 location with S3 prefix based on the date-time stamp (e.g. s3://<S3_BUCKET_NAME>/<S3_BUCKET_PREFIX>/2019-08-13-15-32-02)
  3. myDDBReadThroughputRatio - specifies the proportion of your DDB table RCUs that MR job will consume to complete the operation. It is advised to set this to provisioned throughput based on your recent metrics + extra RCUs for MR job. In other words do not leave your DDB table with "On Demand" provisioning - it will not work. Also , I advise you to be generous with extra RCUs needed for MR job, as it will ensure your EMR cluster resources will complete faster, and extra RCUs for few hours are cheaper then extra EMR compute resources.
  4. myDDBRegion - the region of your DDB table (remember: DDB is multi-region service, regardless of the concept of Global Tables).

Now that we are familiar with these parameters which are needed by this data pipeline, let's look at this log line:

02 May 2017 07:19:04,558 [INFO] (TaskRunnerService-df-0940986HJGYQM1ZJ8BN_@EmrClusterForBackup_2017-04-25T13:31:55-2) df-0940986HJGYQM1ZJ8BN amazonaws.datapipeline.taskrunner.LogMessageUtil: Returning tail errorMsg :    at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)

Although it didn't bubble up on the ERROR log level, this is error message is coming from Hadoop framework not being able to recognize output format location for Hadoop job. When your Data Pipeline submitted task to Hadoop's TaskRunner, it evaluated output location format and realized that it is not something that can be supported. This can mean multiple things:

  1. Your Data Pipeline parameter myOutputS3Loc was changed to invalid value in between runs.
  2. Your Data Pipeline parameter myOutputS3Loc is pointing to S3 bucket that was removed in the meantime.

I would suggest inspecting myOutputS3Loc parameter and passed values to make sure your MR job is getting right inputs. You can also verify what parameters were submitted to EMR task by inspecting controller logs in EMR console while your job is running.