2
votes

I have been working with Apache Pig in recent times. I wanted to extract few columns based on the dataset from yelp. Please look down for the codes that I have used.. I tried running them both in Hortonworks platform as well in my machine (Ubuntu). I get results corresponding to different columns as output. Please point where I make a mistake..

Query:

grunt> business = load 'yelp_academic_dataset_business.json' 
          using JsonLoader('name:chararray, state:chararray');

grunt> business_name = foreach business generate name, state;                                                    
grunt> toPrint = limit business_name 5;                                                                          
grunt> dump toPrint; 

Output:

(5AJdS8LYpCgzfOwGaEqZkA,14362 N Frank Lloyd Wright Blvd Ste B104 Scottsdale, AZ 85260) (6UXw7_U13Th0PZlMXZbjMg,McCarran Airport Across From Gate D1 Southeast Las Vegas, NV) (80VmGCy6UcYYCKC_BONZTQ,524 N 92nd St Scottsdale, AZ 85256) (95p9Xg358BezJyk1wqzzyg,5114 Farwell St Mc Farland, WI 53558) (EkhrRWzevfFJc8Pm2dVPEA,140 University Avenue W Waterloo, ON N2L 3W6)

Sample Input from the file:

{
   "business_id": "vcNAWiLM4dR7D2nwwJ7nCA", 
   "full_address": "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018", 
   "hours": {
            "Tuesday": {"close": "17:00", "open": "08:00"}, 
            "Friday": {"close": "17:00", "open": "08:00"}, 
            "Monday": {"close": "17:00", "open": "08:00"}, 
            "Wednesday": {"close": "17:00", "open": "08:00"},
            "Thursday": {"close": "17:00", "open": "08:00"}
            },
    "open": true,
    "categories": ["Doctors", "Health & Medical"], 
    "city": "Phoenix", "review_count": 7,
    "name": "Eric Goldberg, MD",
    "neighborhoods": [], 
    "longitude": -111.98375799999999,
    "state": "AZ",
    "stars": 3.5,
    "latitude": 33.499313000000001,
    "attributes": {"By Appointment Only": true},
    "type": "business"
}

Edit 2:

I have also copied elephant-bird-2.2.3.jar file into /hadoop/bin folder. Within that folder I call "pig -x local" to launch PIG in local mode. Once it starts up, I do a register of elphant-bird-2.2.3.jar and then proceed with the query.

After inclusion of elephant-jar:

grunt> register elphant-bird-2.2.3.jar;
grunt> business = load 'yelp_academic_dataset_business.json' 
              using JsonLoader('name:chararray, state:chararray');

grunt> business_name = foreach business generate name, state;                                                    
grunt> toPrint = limit business_name 5;                                                                          
grunt> dump toPrint; 
1

1 Answers

0
votes

For nested json you can load it using following script -

register elephant-bird-pig-4.4.jar;
register elephant-bird-core-4.4.jar;
register elephant-bird-hadoop-compat-4.4.jar;
register google-collections-1.0-rc1.jar;
register json_simple-1.1.jar;

business = load 'yelp_academic_dataset_business.json' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
dump business;

Alternatively, you can also load it as -

business = load 'pigJsontest.txt' using com.twitter.elephantbird.pig.load.JsonLoader() as (json:map[]);

Either of these scripts will load json data in map format [key,value format], check here - http://pig.apache.org/docs/r0.11.1/basic.html#map-schema

You would need to access specific element by iterating this map.Example -

id_state = foreach business generate (CHARARRAY)$0#'business_id' as business_id, (CHARARRAY)$0#'state' as state;
dump id_state;