0
votes

I would like to use a function to update a column in a relation. I've figured out how to add a new column with my updated data and drop the old column, but the new column does not contain the fieldname, which I'd like to preserve.

For example, say students.txt is:

John    18      4.0
Mary    19      3.8
Bill    20      3.9
Joe     18      3.8

In Pig:

x = load 'students.txt' as (name:chararray, age:int, gpa:float);

dump x
(John,18,4.0)
(Mary,19,3.8)
(Bill,20,3.9)
(Joe,18,3.8)

describe x
x: {name: chararray,age: int,gpa: float}


y = foreach x generate name, (age==18?999:age), gpa;

dump y;
(John,999,4.0)
(Mary,19,3.8)
(Bill,20,3.9)
(Joe,999,3.8)

describe y;
y: {name: chararray,int,gpa: float}

How can I preserve the name age for the second field, so that y has the same schema as x?

Also, is there an easy way to preserve every column in the dataset except for the old version of this one? (i.e. a star expression or project-range expression that ignores one field).

Or is there a completely better way to go about this?

1

1 Answers

0
votes

I found a quick way to do it. The key is using as [field name] after the function.

y = foreach x generate name, (age==18?999:age) as age, gpa;

dump y
(John,999,4.0)
(Mary,19,3.8)
(Bill,20,3.9)
(Joe,999,3.8)

describe y
y: {name: chararray,age: int,gpa: float}