get first N elements from dataframe ArrayType column in pyspark

Question

I have a spark dataframe with rows as -

1   |   [a, b, c]
2   |   [d, e, f]
3   |   [g, h, i]

Now I want to keep only the first 2 elements from the array column.

1   |   [a, b]
2   |   [d, e]
3   |   [g, h]

How can that be achieved?

Note - Remember that I am not extracting a single array element here, but a part of the array which may contain multiple elements.

Possible duplicate of How to extract an element from a array in pyspark — pault
I already saw that answer, but that is not what I want. I don't want a single item from array, rather I am looking for first N elements. — Vipul Sharma
@pault interestingly enough, linked solution does not seem to work with Spark 2.3.1 (throws exception). Any ideas? — desertnaut
@pault mystery solved! A new user had decided to alter OP's code in linked answer, rendering it wrong (restored it)... — desertnaut
stackoverflow.com/questions/47585279/… not convinced we need to create a tempview for this — thebluephantom

pault pault · Accepted Answer · 2018-10-25T14:09:53

Here's how to do it with the API functions.

Suppose your DataFrame were the following:

df.show()
#+---+---------+
#| id|  letters|
#+---+---------+
#|  1|[a, b, c]|
#|  2|[d, e, f]|
#|  3|[g, h, i]|
#+---+---------+

df.printSchema()
#root
# |-- id: long (nullable = true)
# |-- letters: array (nullable = true)
# |    |-- element: string (containsNull = true)

You can use square brackets to access elements in the letters column by index, and wrap that in a call to pyspark.sql.functions.array() to create a new ArrayType column.

import pyspark.sql.functions as f

df.withColumn("first_two", f.array([f.col("letters")[0], f.col("letters")[1]])).show()
#+---+---------+---------+
#| id|  letters|first_two|
#+---+---------+---------+
#|  1|[a, b, c]|   [a, b]|
#|  2|[d, e, f]|   [d, e]|
#|  3|[g, h, i]|   [g, h]|
#+---+---------+---------+

Or if you had too many indices to list, you can use a list comprehension:

df.withColumn("first_two", f.array([f.col("letters")[i] for i in range(2)])).show()
#+---+---------+---------+
#| id|  letters|first_two|
#+---+---------+---------+
#|  1|[a, b, c]|   [a, b]|
#|  2|[d, e, f]|   [d, e]|
#|  3|[g, h, i]|   [g, h]|
#+---+---------+---------+

For pyspark versions 2.4+ you can also use pyspark.sql.functions.slice():

df.withColumn("first_two",f.slice("letters",start=1,length=2)).show()
#+---+---------+---------+
#| id|  letters|first_two|
#+---+---------+---------+
#|  1|[a, b, c]|   [a, b]|
#|  2|[d, e, f]|   [d, e]|
#|  3|[g, h, i]|   [g, h]|
#+---+---------+---------+

slice may have better performance for large arrays (note that start index is 1, not 0)

get first N elements from dataframe ArrayType column in pyspark

2 Answers