I'm trying to learn apache pig, hadoop and friends, for now I'm working with New York City ticket data.
I load data by:
data = load 'nyc/smallNYC.csv' USING PigStorage(',') AS
(
SummonsNumber:int,
PlateID:chararray,
RegistrationState:chararray,
PlateType:chararray,
...
StreetName:chararray
... // And a lot more
)
Now I'd like to add two new columns to this dataset (or attach two new keys to each dataset), one would be CleanedStreetName
(for the sake this question assume that I want to generate this column using: LOWER(StreetName)), second column would be IssueYear
.
Then I'd like to filter, group and so forth using these columns, I couldn't find any guide that explains how do do this using.
So here are the questions:
- Is this a sensible thing to do? Maybe I should calculate these values on the fly?
- If this is is a sensible thing, please post a snippet that adds
CleanedStreetName
column.