Actions vs Transformations
- Collect (Action) - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or
other operation that returns a sufficiently small subset of the data.
spark-sql doc
select(*cols) (transformation) - Projects a set of expressions and returns a new DataFrame.
Parameters: cols – list of column names (string) or expressions
(Column). If one of the column names is ‘*’, that column is expanded
to include all columns in the current DataFrame.**
df.select('*').collect()
[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
df.select('name', 'age').collect()
[Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)]
df.select(df.name, (df.age + 10).alias('age')).collect()
[Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]
Execution select(column-name1,column-name2,etc)
method on a dataframe, returns a new dataframe which holds only the columns which were selected in the select()
function.
e.g. assuming df
has several columns including "name" and "value" and some others.
df2 = df.select("name","value")
df2
will hold only two columns ("name" and "value") out of the entire columns of df
df2 as the result of select
will be in the executors and not in the driver (as in the case of using collect()
)
sql-programming-guide
df.printSchema()
# root
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)
# Select only the "name" column
df.select("name").show()
# +-------+
# | name|
# +-------+
# |Michael|
# | Andy|
# | Justin|
# +-------+
You can running collect()
on a dataframe (spark docs)
>>> l = [('Alice', 1)]
>>> spark.createDataFrame(l).collect()
[Row(_1=u'Alice', _2=1)]
>>> spark.createDataFrame(l, ['name', 'age']).collect()
[Row(name=u'Alice', age=1)]
spark docs
To print all elements on the driver, one can use the collect() method
to first bring the RDD to the driver node thus:
rdd.collect().foreach(println). This can cause the driver to run out
of memory, though, because collect() fetches the entire RDD to a
single machine; if you only need to print a few elements of the RDD, a
safer approach is to use the take(): rdd.take(100).foreach(println).