0
votes

I'm trying to learn apache pig, hadoop and friends, for now I'm working with New York City ticket data.

I load data by:

data = load 'nyc/smallNYC.csv' USING PigStorage(',') AS 
(
  SummonsNumber:int,
  PlateID:chararray,
  RegistrationState:chararray,
  PlateType:chararray,
  ...
  StreetName:chararray
  ... // And a lot more
)

Now I'd like to add two new columns to this dataset (or attach two new keys to each dataset), one would be CleanedStreetName (for the sake this question assume that I want to generate this column using: LOWER(StreetName)), second column would be IssueYear.

Then I'd like to filter, group and so forth using these columns, I couldn't find any guide that explains how do do this using.

So here are the questions:

  • Is this a sensible thing to do? Maybe I should calculate these values on the fly?
  • If this is is a sensible thing, please post a snippet that adds CleanedStreetName column.
1
You want to generate two new columns(CleanedStreetName and IssueYear) on the fly based on some filter condition in the existing inputs? If Yes then its possible. can you paste your sample input data and your final output?.Sivasakthi Jayaraman

1 Answers

1
votes

In Pig you use FOREACH to generate projections of the data.

You didn't specify how you want to get IssueYear so I just assigned it as 0

NEWDATA = FOREACH data GENERATE *,LOWER(StreetName) as CleanedStreetName, 0 as IssueYear