I want col4 and col5 should comes as ArrayType they are coming as StringType. It is in pyspark. I want to know how we can do this.
col4 --array (nullable = true)
|-- element: IntegerType() (containsNull = true)
col5:--array (nullable = true)
|-- element: string (containsNull = true)
| id| value|
| 1| [foo, foo]|
| 2|[bar, tooo]|
|id |value |TF_CUS(value) |
|1 |[foo, foo] |[[foo], [2]] |
|2 |[bar, tooo]|[[bar, tooo], [1, 1]]|
|id |value |TF_CUS |col4 |col5 |
|1 |[foo, foo] |[[foo], [2]] |[2] |[foo] |
|2 |[bar, tooo]|[[bar, tooo], [1, 1]]|[1, 1]|[bar, tooo]|
looking forward to see solutions
|-- id: long (nullable = true)
|-- value: array (nullable = true)
| |-- element: string (containsNull = true)
|-- TF_CUS: array (nullable = true)
| |-- element: string (containsNull = true)
|-- col4: string (nullable = true)
|-- col5: string (nullable = true)
from pyspark.sql.types import *
from pyspark.sql.functions import udf
from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import StringType
from pyspark.sql.types import DoubleType
from pyspark.sql.types import ArrayType
def TF_CUS(lista):
from collections import Counter
counts = (Counter(lista))
return (list(counts.keys()), list(counts.values()))
TF_CUS_cols = udf(TF_CUS, ArrayType(StringType()))
df = sc.parallelize([(1, ["foo","foo"] ), (2, ["bar", "tooo"])]).toDF(["id", "value"])
df.select("*", TF_CUS_cols(df["value"])).show(2, False)
df = df.select("*", TF_CUS_cols(df["value"]).alias("TF_CUS"))
df.withColumn("col4", df["TF_CUS"].getItem(1)).withColumn("col5", df["TF_CUS"].getItem(0)).show(2, False)
df = df.withColumn("col4", (df["TF_CUS"].getItem(1))).withColumn("col5", df["TF_CUS"].getItem(0))