I have a PySpark DataFrame column with string data that looks like this:
[[Closed], [10:30am-11:30am, 12:30pm-1:30pm, 7:00pm-8:00pm], [12:30pm-1:30pm, 6:00pm-7:00pm, 7:30pm-8:30pm], [10:30am-11:30am, 12:30pm-1:30pm, 7:00pm-8:00pm], [12:30pm-1:30pm, 6:00pm-7:00pm], [12:30pm-1:30pm, 7:00pm-8:00pm], [10:00am-11:00am, 12:00pm-1:00pm]]
It is a string, but should ideally be an array with 7 elements (Sunday-Saturday).
I wanted to convert this column to an array, so I could access the 0th element:
[Closed]
1st element:
[10:30am-11:30am, 12:30pm-1:30pm, 7:00pm-8:00pm]
2nd element:
[12:30pm-1:30pm, 6:00pm-7:00pm, 7:30pm-8:30pm]
etc.
I tried to cast the column to array using this code:
hours = hours.withColumn(
"hoursArray",
split(col("hours"), ",\s*").cast(ArrayType(StringType())).alias("ev"))
However, when I try and access the 1st element of the resulting array, I get:
[10:30am-11:30am
instead of:
[10:30am-11:30am, 12:30pm-1:30pm, 7:00pm-8:00pm]
What gives? I'm assuming I'm splitting on commas instead of splitting on closing brackets, but I've tried changing that to no results. Is there something simple I'm doing wrong? Thank you
numpy.squeeze
. This might give you expected result. Haven't tried but that should work I guess. – Aditya"\],\s*\["
. – pansen