1
votes

I'm using Apache Pig to do some data processing work. I wrote a Pig Latin script like this:

raw = Load 'data.csv' USING MyLoader();
repaired = FOREACH raw GENERATE MyRepairFunc(*);
filtered = FOREACH repaired GENERATE $0 AS name:chararray, $3 AS age:int;
DUMP filtered;

Pig arose an error:

java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be cast to java.lang.Integer

at org.apache.pig.backend.hadoop.HDataType.getWritableComparableTypes(HDataType.java:115)

at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Map.collect(PigGenericMapReduce.java:124)

at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:281)

at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:274)

at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)

at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

It's a data casting problem. Due to the fact that the raw data may contain some damaged records, I cannot determine the schema when loading, in case of data loss.

Then what should I do to fix this? Thanks a lot

1

1 Answers

0
votes

You should fix your raw data(data cleansing) before data analyze.

There is a pig UDF and try to cleanse raw data with expected pattern, but did not be merged into main branch. PIG-3735 UDF to data cleanse the dirty data with expected pattern

You can try to cleanse raw data with your favorite tools. Please refer to the tools recommended in https://infocus.emc.com/david_dietrich/the-dirty-little-secret-of-big-data-projects/