110
votes

I am trying to find out the size/shape of a DataFrame in PySpark. I do not see a single function that can do this.

In Python I can do

data.shape()

Is there a similar function in PySpark. This is my current solution, but I am looking for an element one

row_number = data.count()
column_number = len(data.dtypes)

The computation of the number of columns is not ideal...

6
Put this in a function ? - GwydionFR
You mean data.shape for NumPy and Pandas? shape is not a function. - flow2k
What is not ideal? I am not sure what else you would like to accomplish than what you already have (except for replacing data.dtypes with data.columns, but it makes little difference). - Melkor.cz

6 Answers

176
votes

You can get its shape with:

print((df.count(), len(df.columns)))
70
votes

Use df.count() to get the number of rows.

35
votes

Add this to the your code:

import pyspark
def spark_shape(self):
    return (self.count(), len(self.columns))
pyspark.sql.dataframe.DataFrame.shape = spark_shape

Then you can do

>>> df.shape()
(10000, 10)

But just remind you that .count() can be very slow for very large table that has not been persisted.

9
votes
print((df.count(), len(df.columns)))

is easier for smaller datasets.

However if the dataset is huge, an alternative approach would be to use pandas and arrows to convert the dataframe to pandas df and call shape

spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.conf.set("spark.sql.crossJoin.enabled", "true")
print(df.toPandas().shape)
3
votes

I think there is not similar function like data.shape in Spark. But I will use len(data.columns) rather than len(data.dtypes)

0
votes

I have solved this problem using this code block. Please try it, it works.

import pyspark
def sparkShape(dataFrame):
    return (dataFrame.count(), len(dataFrame.columns))
pyspark.sql.dataframe.DataFrame.shape = sparkShape

print(<Input the Dataframe name which you want the output of>.shape())