1
votes

I want to put aggregated data into memory but getting error.Any suggestion ??

orders = spark.read.json("/user/order_items_json")

df_2 = orders.where("order_item_order_id == 2").groupby("order_item_order_id")

df_2.persist(StorageLevel.MEMORY_ONLY)**

Traceback (most recent call last): File "", line 1, in AttributeError: 'GroupedData' object has no attribute 'persist'

1

1 Answers

1
votes

Spark requires aggregation expression on grouped data.

If you don't need any aggregations on the grouped data then we can have some dummy aggregation like first,count...etc and drop the column from .select like below:

import pyspark

df_2 = orders.where("order_item_order_id == 2").groupby("order_item_order_id").agg(first(lit("1"))).select("order_item_order_id")
#or
df_2 = orders.where("order_item_order_id == 2").groupby("order_item_order_id").count().select("order_item_order_id")

df_2.persist(pyspark.StorageLevel.MEMORY_ONLY)