In pyspark I have a data frame composed of two columns
Assume the details in the array of array are timestamp, email, phone number, first name, last name, address, city, country, randomId
+-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| str1 | array_of_str |
+-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| random column data1 | [[‘2020-01-26 17:30:57.000 +0000’, ’’, ‘728-802-5766’, ‘’, ‘’, ‘7th street crossroads’, ‘seattle’, ‘’, ‘randomId104’], [‘2019-07-20 20:54:57.000 +0000’, ’[email protected]’, ‘728-802-5766’, ‘Katuscha’, ‘’, ‘’, ‘’, ‘us’, ‘randomId225’], [‘2015-12-04 04:54:57.000 +0000’, ’[email protected]’, ‘728-802-5766’, ‘’, ‘Othen’, ‘7th street crossroads’, ‘seattle’, ‘’, ‘randomid306’]]|
| random column data2 | [[‘2021-01-30 17:30:04.000 +0000’, ’[email protected]’, ‘313-984-9692’, ‘’, ‘’, ‘th street crossroads’, ‘New york’, ‘us’, ‘randomId563’], [‘2018-05-15 20:44:57.000 +0000’, ’[email protected]’, ‘’, ‘Marianne’, ‘Allmann’, ‘’, ‘’, ‘us’, ‘randomId884’]] |
+-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I am expecting output data frame like below
+-------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| str1 | array_of_str |
+-------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| random column data1 | [‘2020-01-26 17:30:57.000 +0000’, ’[email protected]’, ‘728-802-5766’, ‘Katuscha’, ‘Othen’, ‘7th street crossroads’, ‘seattle’, ‘us’, ‘randomid306’] |
| random column data2 | [‘2021-01-30 17:30:04.000 +0000’, ’[email protected]’, ‘313-984-9692’, ‘Marianne’, ‘Allmann’, ‘111th Ave NE’, ‘New york’, ‘us’, ‘randomId884’] |
+-------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------- +
optional :- existing data in array of array might not be already sorted at decreasing timestamp order. How do I sort the array of array in decreasing timestamp order
Here I am planning to write a udf to pull latest non null (timestamp, email, phone number, first name, last name, address, city, country) data from array of array. In case of randomId, I will always pull the randomId associated with the oldest record in the system.
example:- for random column data1 emailId i.e. [email protected] is getting populated from second element in the array since the first one is having empty email id.
similar is the case with other columns.
In case of randomid randomid306 for first record is the oldest entry so its populated in my output data frame.
In the udf
How do I sort the array of array elements in descending timestamp order? - kind of an optional step
How do i iterate over the array of array column in the data frame?
3)How to access individual items on the array in a udf?
like in the case of python we can iterate over list of list elements like
for item in items:
print(item[0], item[1])
how can I achieve similar thing for array of array columns in pyspark?
can I do above steps in pyspark by not converting the data to pandas dataframe?
spark version 2.4.3 python 3.6.8