How to filter a pyspark dataframe based on first value of an array in a column?

Question

Consider that I have this dataframe in pyspark:

+--------+----------------+---------+---------+
|DeviceID| TimeStamp      |range    | zipcode |
+--------+----------------+---------+---------+
|   00236|11-03-2014 07:33|[4.5, 2] | 90041   |
|   00234|11-06-2014 05:55|[6.2, 8] | 90037   |
|   00234|11-06-2014 05:55|[5.6, 4] | 90037   |
|   00235|11-09-2014 05:33|[7.5, 6] | 90047   |
+--------+----------------+---------+---------+

How can I write an script that keep rows when the first value in range array is greater than 6. The output should be like this:

+--------+----------------+---------+---------+
|DeviceID| TimeStamp      |range    | zipcode |
+--------+----------------+---------+---------+
|   00234|11-06-2014 05:55|[6.2, 8] | 90037   |
|   00235|11-09-2014 05:33|[7.5, 6] | 90047   |
+--------+----------------+---------+---------+

I wrote this scripts:

import pyspark.sql.functions as f
df.filter(f.col("range")[0] > 6)

but I got this error:

AnalysisException: u"Can't extract value from range#12989: need struct type but got vector;"

could you paste your schema? df.schema, see whether your range column is array or vector, i guess your range is in vector type which cant be accessed by [0], and you have to convert your range column to array type — E.ZY.

Cena Cena · Accepted Answer · 2020-09-18T04:57:23

df.filter(df.range[0]>6.0).show()

OR

df.withColumn("first_element", df.range[0])\
    .filter(col("first_element")>6.0).drop("first_element").show()

Output:

+--------+----------------+----------+-------+
|DeviceID|       TimeStamp|     range|zipcode|
+--------+----------------+----------+-------+
|   00235|11-09-2014 05:33|[7.5, 6.0]|  90047|
|   00234|11-06-2014 05:55|[6.2, 8.0]|  90037|
+--------+----------------+----------+-------+

How to filter a pyspark dataframe based on first value of an array in a column?

1 Answers