I have been working with Apache Pig in recent times. I wanted to extract few columns based on the dataset from yelp. Please look down for the codes that I have used.. I tried running them both in Hortonworks platform as well in my machine (Ubuntu). I get results corresponding to different columns as output. Please point where I make a mistake..
Query:
grunt> business = load 'yelp_academic_dataset_business.json'
using JsonLoader('name:chararray, state:chararray');
grunt> business_name = foreach business generate name, state;
grunt> toPrint = limit business_name 5;
grunt> dump toPrint;
Output:
(5AJdS8LYpCgzfOwGaEqZkA,14362 N Frank Lloyd Wright Blvd Ste B104 Scottsdale, AZ 85260) (6UXw7_U13Th0PZlMXZbjMg,McCarran Airport Across From Gate D1 Southeast Las Vegas, NV) (80VmGCy6UcYYCKC_BONZTQ,524 N 92nd St Scottsdale, AZ 85256) (95p9Xg358BezJyk1wqzzyg,5114 Farwell St Mc Farland, WI 53558) (EkhrRWzevfFJc8Pm2dVPEA,140 University Avenue W Waterloo, ON N2L 3W6)
Sample Input from the file:
{
"business_id": "vcNAWiLM4dR7D2nwwJ7nCA",
"full_address": "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018",
"hours": {
"Tuesday": {"close": "17:00", "open": "08:00"},
"Friday": {"close": "17:00", "open": "08:00"},
"Monday": {"close": "17:00", "open": "08:00"},
"Wednesday": {"close": "17:00", "open": "08:00"},
"Thursday": {"close": "17:00", "open": "08:00"}
},
"open": true,
"categories": ["Doctors", "Health & Medical"],
"city": "Phoenix", "review_count": 7,
"name": "Eric Goldberg, MD",
"neighborhoods": [],
"longitude": -111.98375799999999,
"state": "AZ",
"stars": 3.5,
"latitude": 33.499313000000001,
"attributes": {"By Appointment Only": true},
"type": "business"
}
Edit 2:
I have also copied elephant-bird-2.2.3.jar file into /hadoop/bin folder. Within that folder I call "pig -x local" to launch PIG in local mode. Once it starts up, I do a register of elphant-bird-2.2.3.jar and then proceed with the query.
After inclusion of elephant-jar:
grunt> register elphant-bird-2.2.3.jar;
grunt> business = load 'yelp_academic_dataset_business.json'
using JsonLoader('name:chararray, state:chararray');
grunt> business_name = foreach business generate name, state;
grunt> toPrint = limit business_name 5;
grunt> dump toPrint;