0
votes

I have a PySpark DataFrame column with string data that looks like this:

[[Closed], [10:30am-11:30am, 12:30pm-1:30pm, 7:00pm-8:00pm], [12:30pm-1:30pm, 6:00pm-7:00pm, 7:30pm-8:30pm], [10:30am-11:30am, 12:30pm-1:30pm, 7:00pm-8:00pm], [12:30pm-1:30pm, 6:00pm-7:00pm], [12:30pm-1:30pm, 7:00pm-8:00pm], [10:00am-11:00am, 12:00pm-1:00pm]]

It is a string, but should ideally be an array with 7 elements (Sunday-Saturday).

I wanted to convert this column to an array, so I could access the 0th element:

[Closed]

1st element:

[10:30am-11:30am, 12:30pm-1:30pm, 7:00pm-8:00pm]

2nd element:

[12:30pm-1:30pm, 6:00pm-7:00pm, 7:30pm-8:30pm]

etc.

I tried to cast the column to array using this code:

hours = hours.withColumn(
"hoursArray",
split(col("hours"), ",\s*").cast(ArrayType(StringType())).alias("ev"))

However, when I try and access the 1st element of the resulting array, I get:

[10:30am-11:30am

instead of:

[10:30am-11:30am, 12:30pm-1:30pm, 7:00pm-8:00pm]

What gives? I'm assuming I'm splitting on commas instead of splitting on closing brackets, but I've tried changing that to no results. Is there something simple I'm doing wrong? Thank you

1
i'd break it all the way down into each element you will need and then build the lists back up again. if that makes sense.SuperStew
I think you can convert this into numpy array and then use numpy.squeeze. This might give you expected result. Haven't tried but that should work I guess.Aditya
Your solution works for me when I use "\],\s*\[".pansen
I apologize but I do not do python so I can only guess here, but I think you should use regex match, it that exists, rather than split. Split will divide the string but it will remove the matching parts and you want to keep a portion of the matching parts and a proper pattern match will do this.Jeff

1 Answers

0
votes

@pansen fixed this. It was a regex fix, super clean. The corrected code is:

hours = hours.withColumn(
"hoursArray",
split(col("hours"), "\],\s*\[").cast(ArrayType(StringType())).alias("ev"))