I have the following data in as a .txt file in tab separated format stored in my blob storage. I'm using pyspark.sql to load the data into databricks as a pyspark.sql.df.
This is the shape of the data.
df = spark.createDataFrame(
[
(302, 'foo'), # values
(203, 'bar'),
(202, 'foo'),
(202, 'bar'),
(172, 'xxx'),
(172, 'yyy'),
],
['LU', 'Input'] # column labels
)
display(df)
First I have created a schema for the data before loading:
from pyspark.sql.types import *
data_schema = [
StructField('LU', StringType(), True),
StructField('Input', StringType(), True)]
mySchema = StructType(fields=data_schema)
I then use the following code to read in the data:
df = spark.read.csv("/filepath/filename.txt", schema=mySchema , header=True)
df.show()
However, when I look at the data the first column looks fine, but the second column values show as null.
+----------+-----+
| LU|Input|
+----------+-----+
|302 | null|
|203 | null|
|202 | null|
|202 | null|
|172 | null|
|172 | null|
+----------+-----+
Does anyone know why the 'Input' variable shows as null? This is just dummy data, when using real data that has 30+ variables only the first variables values ever load, everything else is null.
Thanks
/filepath/filename.txt
file.. – notNulldf = spark.read.csv("/filepath/filename.txt", schema=mySchema , header=True, sep='\t')
. You can just add some lines as plain text to your question. – cronoik