I would like to use a function to update a column in a relation. I've figured out how to add a new column with my updated data and drop the old column, but the new column does not contain the fieldname, which I'd like to preserve.
For example, say students.txt
is:
John 18 4.0
Mary 19 3.8
Bill 20 3.9
Joe 18 3.8
In Pig:
x = load 'students.txt' as (name:chararray, age:int, gpa:float);
dump x
(John,18,4.0)
(Mary,19,3.8)
(Bill,20,3.9)
(Joe,18,3.8)
describe x
x: {name: chararray,age: int,gpa: float}
y = foreach x generate name, (age==18?999:age), gpa;
dump y;
(John,999,4.0)
(Mary,19,3.8)
(Bill,20,3.9)
(Joe,999,3.8)
describe y;
y: {name: chararray,int,gpa: float}
How can I preserve the name age
for the second field, so that y
has the same schema as x
?
Also, is there an easy way to preserve every column in the dataset except for the old version of this one? (i.e. a star expression or project-range expression that ignores one field).
Or is there a completely better way to go about this?