IndexError: list index out of range when manually creating a spark data frame?

Question

I'm trying to create a spark dataframe of (one column DT and one row with date of 2020-1-1) manually.

DT
=======
2020-01-01

However, it got the error of list index out of range?

spark = SparkSession.builder\
        .master(f'spark://{IP}:7077')\
        .config('spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version', '2')\
        .appName('g data')\
        .getOrCreate()

spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')

dates = spark.createDataFrame([(pd.to_datetime('2020-1-1'))], ['DT'])

Traceback:

 in brand_tagging_since_until(spark, since, until)
---> 81         dates = spark.createDataFrame([(pd.to_datetime('2020-1-1'))], ['DT'])

/usr/local/bin/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    746             rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
    747         else:
--> 748             rdd, schema = self._createFromLocal(map(prepare, data), schema)
    749         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
    750         jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

/usr/local/bin/spark/python/pyspark/sql/session.py in _createFromLocal(self, data, schema)
    419             if isinstance(schema, (list, tuple)):
    420                 for i, name in enumerate(schema):
--> 421                     struct.fields[i].name = name
    422                     struct.names[i] = name
    423             schema = struct

What are the row and columns you're trying to create? Is DT the column name in a single column dataframe, or a value in the row? — Nick Becker
DT is the column name with type of datetime, it should have a single row of 2020-01-01. — ca9163d9
Thanks. I'll add an answer. There are two seperate wrinkles here. — Nick Becker

Nick Becker Nick Becker · Accepted Answer · 2021-01-05T03:50:07

There are two issues here, though one is not surfaced in your example. Your immediate issue is that the constructor is expecting a , after the value in the tuple. But, just adding this naively will silently fail, as the constructor doesn't know what to do with a pandas Timestamp object.

from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession.builder.appName("timestamp").getOrCreate()

val = pd.to_datetime('2020-1-1')
spark.createDataFrame(
    data=[(val,)],
    schema=["DT"]
).show()
+---+
| DT|
+---+
| []|
+---+

You'll want to convert this to a raw Python datetime object beforehand if you want to use the constructor like this.

from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession.builder.appName("timestamp").getOrCreate()

val = pd.to_datetime('2020-1-1')
spark.createDataFrame(
    data=[(val.to_pydatetime(),)],
    schema=["DT"]
).show()
+-------------------+
|                 DT|
+-------------------+
|2020-01-01 00:00:00|
+-------------------+

With that said, it's not clear to me where this is most cleanly documented. If you're curious, you can see this requirement in the Spark codebase, or in the source code docs.

If you pass a pandas DataFrame to the constructor, this is handled under the hood.

df = pd.DataFrame({"DT": [val]})
spark.createDataFrame(
    data=df
).show()
+-------------------+
|                 DT|
+-------------------+
|2020-01-01 00:00:00|
+-------------------+

IndexError: list index out of range when manually creating a spark data frame?

2 Answers