Is it possible to update a hiveContext dataframe column in pyspark using a complex function not doable in a UDF?
I have a dataframe containing many columns, out of which 2 columns are called timestamp and data. I need to retrieve the timestamp from within the JSON string in data and update the timestamp column if the timestamp in data fulfills certain criteria. I know that that dataframes are immutable, but is possible to somehow build a new dataframe retaining all the columns of the old dataframe but updating the timstamp column?
Code illustrating what i would like to do:
def updateTime(row):
import json
THRESHOLD_TIME = 60 * 30
client_timestamp = json.loads(row['data'])
client_timestamp = float(client_timestamp['timestamp'])
server_timestamp = float(row['timestamp'])
if server_timestamp - client_timestamp <= THRESHOLD_TIME:
new_row = ..... # copy contents of row
new_row['timestamp'] = client_timestamp
return new_row
else:
return row
df = df.map(updateTime)
I thought of mapping the row contents to a tuple and then converting it back to a dataframe with .toDF() but I can't find a way to copy the row contents into a tuple and then getting back the column names.
UDF
? – Alberto Bonsanto