14
votes

I have a concept I hope you can help to clarify:

What's the difference between the following three ways of referring to a column in PySpark dataframe. I know different situations need different forms, but not sure why.

  1. df.col: e.g. F.count(df.col)
  2. df['col']: e.g. df['col'] == 0
  3. F.col('col'): e.g. df.filter(F.col('col').isNull())

Thanks a lot!

1

1 Answers

21
votes

In most practical applictions, there is almost no difference. However, they are implemented by calls to different underlying functions (source) and thus are not exactly the same.

We can illustrate with a small example:

df = spark.createDataFrame(
    [(1,'a', 0), (2,'b',None), (None,'c',3)], 
    ['col', '2col', 'third col']
)

df.show()
#+----+----+---------+
#| col|2col|third col|
#+----+----+---------+
#|   1|   a|        0|
#|   2|   b|     null|
#|null|   c|        3|
#+----+----+---------+

1. df.col

This is the least flexible. You can only reference columns that are valid to be accessed using the . operator. This rules out column names containing spaces or special characters and column names that start with an integer.

This syntax makes a call to df.__getattr__("col").

print(df.__getattr__.__doc__)
#Returns the :class:`Column` denoted by ``name``.
#
#        >>> df.select(df.age).collect()
#        [Row(age=2), Row(age=5)]
#
#        .. versionadded:: 1.3

Using the . syntax, you can only access the first column of this example dataframe.

>>> df.2col
  File "<ipython-input-39-8e82c2dd5b7c>", line 1
    df.2col
       ^
SyntaxError: invalid syntax

Under the hood, it checks to see if the column name is contained in df.columns and then returns the pyspark.sql.Column specified.

2. df["col"]

This makes a call to df.__getitem__. You have some more flexibility in that you can do everything that __getattr__ can do, plus you can specify any column name.

df["2col"]
#Column<2col> 

Once again, under the hood some conditionals are checked and in this case the pyspark.sql.Column specified by the input string is returned.

In addition, you can as pass in multiple columns (as a list or tuple) or column expressions.

from pyspark.sql.functions import expr
df[['col', expr('`third col` IS NULL')]].show()
#+----+-------------------+
#| col|(third col IS NULL)|
#+----+-------------------+
#|   1|              false|
#|   2|               true|
#|null|              false|
#+----+-------------------+

Note that in the case of multiple columns, __getitem__ is just making a call to pyspark.sql.DataFrame.select.

Finally, you can also access columns by index:

df[2]
#Column<third col>

3. pyspark.sql.functions.col

This is the Spark native way of selecting a column and returns a expression (this is the case for all column functions) which selects the column on based on the given name. This is useful shorthand when you need to specify that you want a column and not a string literal.

For example, supposed we wanted to make a new column that would take on the either the value from "col" or "third col" based on the value of "2col":

from pyspark.sql.functions import when

df.withColumn(
    'new', 
    f.when(df['2col'].isin(['a', 'c']), 'third col').otherwise('col')
).show()
#+----+----+---------+---------+
#| col|2col|third col|      new|
#+----+----+---------+---------+
#|   1|   a|        0|third col|
#|   2|   b|     null|      col|
#|null|   c|        3|third col|
#+----+----+---------+---------+

Oops, that's not what I meant. Spark thought I wanted the literal strings "col" and "third col". Instead, what I should have written is:

from pyspark.sql.functions import col
df.withColumn(
    'new', 
    when(df['2col'].isin(['a', 'c']), col('third col')).otherwise(col('col'))
).show()
#+----+----+---------+---+
#| col|2col|third col|new|
#+----+----+---------+---+
#|   1|   a|        0|  0|
#|   2|   b|     null|  2|
#|null|   c|        3|  3|
#+----+----+---------+---+

Because is col() creates the column expression without checking there's two interesting side effects of this.

  1. It can be re-used as it's not df specific
  2. It can be used before the df is assigned
age = col('dob') / 365
if_expr = when(age < 18, 'underage').otherwise('adult')

df1 = df.read.csv(path).withColumn('age_category', if_expr)

df2 = df.read.parquet(path)\
    .select('*', age.alias('age'), if_expr.alias('age_category'))

age generates Column<b'(dob / 365)'>
if_expr generates Column<b'CASE WHEN ((dob / 365) < 18) THEN underage ELSE adult END'>