Hi,
I am having trouble using non-StringType as a part of the schema that I use in loading a csv file to create a dataframe.
I was expecting the given schema to be used to convert each field of each record to corresponding data type on the fly while loading.
Instead, all I get is null values.
Here is a simplified way of how to reproduce my problem. In this example, there is a small csv file with four columns that I want to treat, correspondingly, as str, date, int, and bool:
python
Python 3.6.5 (default, Jun 17 2018, 12:13:06)
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
>>> from pyspark import SparkContext
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql.types import *
>>>
>>> data_flnm = 'four_cols.csv'
>>> lines = [ln.rstrip() for ln in open(data_flnm).readlines()[:3]]
>>> lines
['zzzc7c09:66d7:47d6:9415:87e5010fe282|2019-04-08|0|f', 'zzz304fa:6fc0:4337:91d0:05ef4657a6db|2019-07-08|1|f', 'yy251cf0:aa11:44e9:88f4:f6f9c1899cee|2019-05-13|0|t']
>>> parts = [ln.split("|") for ln in lines]
>>> parts
[['zzzc7c09:66d7:47d6:9415:87e5010fe282', '2019-04-08', '0', 'f'], ['zzz304fa:6fc0:4337:91d0:05ef4657a6db', '2019-07-08', '1', 'f'], ['yy251cf0:aa11:44e9:88f4:f6f9c1899cee', '2019-05-13', '0', 't']]
>>> cols1 = [StructField('u_id', StringType(), True), StructField('week', StringType(), True), StructField('flag_0_1', StringType(), True), StructField('flag_t_f', StringType(), True)]
>>> cols2 = [StructField('u_id', StringType(), True), StructField('week', DateType(), True), StructField('flag_0_1', IntegerType(), True), StructField('flag_t_f', BooleanType(), True)]
>>> sch1 = StructType(cols1)
>>> sch2 = StructType(cols2)
>>> sch1
StructType(List(StructField(u_id,StringType,true),StructField(week,StringType,true),StructField(flag_0_1,StringType,true),StructField(flag_t_f,StringType,true)))
>>> sch2
StructType(List(StructField(u_id,StringType,true),StructField(week,DateType,true),StructField(flag_0_1,IntegerType,true),StructField(flag_t_f,BooleanType,true)))
>>> spark_sess = SparkSession.builder.appName("xyz").getOrCreate()
19/09/10 19:32:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
>>> df1 = spark_sess.read.format('csv').option("nullValue", "null").load([data_flnm], sep='|', schema = sch1)
>>> df2 = spark_sess.read.format('csv').option("nullValue", "null").load([data_flnm], sep='|', schema = sch2)
>>> df1.show(5)
+--------------------+----------+--------+--------+
| u_id| week|flag_0_1|flag_t_f|
+--------------------+----------+--------+--------+
|zzzc7c09:66d7:47d...|2019-04-08| 0| f|
|zzz304fa:6fc0:433...|2019-07-08| 1| f|
|yy251cf0:aa11:44e...|2019-05-13| 0| t|
|yy1d2f8e:d8f0:4db...|2019-07-08| 1| f|
|zzz5ccad:2cf6:44e...|2019-05-20| 1| f|
+--------------------+----------+--------+--------+
only showing top 5 rows
>>> df2.show(5)
+----+----+--------+--------+
|u_id|week|flag_0_1|flag_t_f|
+----+----+--------+--------+
|null|null| null| null|
|null|null| null| null|
|null|null| null| null|
|null|null| null| null|
|null|null| null| null|
+----+----+--------+--------+
only showing top 5 rows
>>>
I tried a few different versions of .read(...)....load(...) code. None produce the expected result. Please advice. Thank you!
PS: could not add tags "structfield" and "structtype" : not enough reputation (__.