0
votes

Tweets from twitter are stored in hdfs in hadoop. The tweets need to be processed for sentiment analysis. The tweets in hdfs are in avro format so they need to be processed using Json loader But in pig scripting the tweets from hdfs are not getting read.After changing jar files the pig script is showing failed message

By using these following jar files by pig script is getting failed.

REGISTER '/home/cloudera/Desktop/elephant-bird-hadoop-compat-4.17.jar';

REGISTER '/home/cloudera/Desktop/elephant-bird-pig-4.17.jar';

REGISTER '/home/cloudera/Desktop/json-simple-3.1.0.jar';

These are another set of jar files with which its not failing but data is also not getting read.

REGISTER '/home/cloudera/Desktop/elephant-bird-hadoop-compat-4.17.jar';

REGISTER '/home/cloudera/Desktop/elephant-bird-pig-4.17.jar';

REGISTER '/home/cloudera/Desktop/json-simple-1.1.jar';

Here is all my pig scripting commands i have used:

tweets = LOAD '/user/cloudera/OutputData/tweets' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;

B = FOREACH tweets GENERATE myMap#'id' as id ,myMap#'tweets' as tweets;

tokens = foreach B generate id, tweets, FLATTEN(TOKENIZE(tweets)) As word;

dictionary = load ' /user/cloudera/OutputData/AFINN.txt' using PigStorage('\t') AS(word:chararray,rating:int);

word_rating = join tokens by word left outer, dictionary by word using 'replicated';

describe word_rating;

rating = foreach word_rating generate tokens::id as id,tokens::tweets as tweets, dictionary::rating as rate;

word_group = group rating by (id,tweets);

avg_rate = foreach word_group generate group, AVG(rating.rate) as tweet_rating;

positive_tweets = filter avg_rate by tweet_rating>=0;
DUMP positive_tweets;

negative_tweets = filter avg_rate by tweet_rating<=0;

DUMP negative_tweets;

Error on dumping above tweets command for the first set of jar files:

Input(s): Failed to read data from "/user/cloudera/OutputData/tweets"

Output(s): Failed to produce result in "hdfs://quickstart.cloudera:8020/tmp/temp-1614543351/tmp37889715"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1556902124324_0001


2019-05-03 09:59:09,409 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2019-05-03 09:59:09,427 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias tweets. Backend error : org.json.simple.parser.ParseException
Details at logfile: /home/cloudera/pig_1556902594207.log

Error on dumping above tweets command for the second set of jar files:

Input(s): Successfully read 0 records (5178477 bytes) from: "/user/cloudera/OutputData/tweets"

Output(s): Successfully stored 0 records in: "hdfs://quickstart.cloudera:8020/tmp/temp-1614543351/tmp479037703"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1556902124324_0002


2019-05-03 10:01:05,417 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2019-05-03 10:01:05,418 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2019-05-03 10:01:05,418 [main] INFO  org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2019-05-03 10:01:05,428 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2019-05-03 10:01:05,428 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

Expected output was sorted positive and neative tweets but getting errors. Please do help. Thank you.

1

1 Answers

0
votes

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias tweets. Backend error : org.json.simple.parser.ParseException This usually indicates a syntax error in the Pig script.

The AS keyword in a LOAD statement usually require a schema. myMap in your LOAD statement is not a valid schema.

See https://stackoverflow.com/a/12829494/8886552 for an example of JsonLoader.