4
votes

I got the following error when loading data from Impala to Vertica with Sqoop.

Error: java.io.IOException: Can't export data, please check failed map task logs at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112) at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.io.IOException: java.sql.BatchUpdateException: [Vertica]VJDBC One or more rows were rejected by the server. at org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWriter.java:233) at org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWriter.java:46) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:658) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112) at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:84) ... 10 more Caused by: java.sql.BatchUpdateException: [Vertica]VJDBC One or more rows were rejected by the server. at com.vertica.jdbc.SStatement.processBatchResults(Unknown Source) at com.vertica.jdbc.SPreparedStatement.executeBatch(Unknown Source) at org.apache.sqoop.mapreduce.AsyncSqlOutputFormat$AsyncSqlExecThread.run(AsyncSqlOutputFormat.java:231)

And I was running the following command:

sudo -u impala sqoop export -Dsqoop.export.records.per.statement=xxx --driver com.vertica.jdbc.Driver --connect jdbc:vertica://host:5433/db --username name --password pw --table table --export-dir /some/dir -m 1 --input-fields-terminated-by '\t' --input-lines-terminated-by '\n' --batch

This error was not raised every time. I had several successful tests loading over 2 million rows of data. So I guess there might be some bad data that contains special characters in the rejected rows. This is very annoying because when this error raised, mapreduce job would rollback and retry. In this case, there would be lots of duplicate data in the target table.

Does anyone have idea if there is any sqoop export parameter that can be set to deal with special characters or if there is any way to skip the bad data, which means to disable rollback? Thanks!

1
Why not just try something like Talend?Kermit
What is the error for failed map tasks ?rpwr021

1 Answers

8
votes

This may not be just special characters. If you try to stuff 'abc' into a numeric field, for example, that row would get rejected. Even though you get this error, I believe it not until after the load and all data should be committed that could be committed (but I would verify that). If you isolate the "missing" rows you might be able to figure out what is wrong with the data or the field definition.

Common things to look for:

  1. Stuffing character type data into numeric fields (maybe implicit conversions, or only show up when the values are non-NULL).
  2. NULL values into NOT NULL fields
  3. Counting characters and VARCHAR octets as equivalent. VARCHAR(x) represents octets, but a UTF-8 character can have multiple octets.
  4. Similar to #3, strings too long to fit in designated fields.

In the driver, the batch inserts are being replaced with a COPY FROM STDIN statement. You might be able to find the statement in query_requests although I'm not sure it will help.

Sqoop doesn't really give you much opportunity to investigate this further (as far as I am aware, I checked the generic JDBC Loader). One could look at the return array for executeBatch() and tie this to your execution batch. Maybe modify the generic JDBC loader?

Hope this helps.