Pyspark SQL Issue loading a tsv file as a dataframe

Question

I have the following data in as a .txt file in tab separated format stored in my blob storage. I'm using pyspark.sql to load the data into databricks as a pyspark.sql.df.

This is the shape of the data.

df = spark.createDataFrame(
    [
    (302, 'foo'), # values
    (203, 'bar'),
    (202, 'foo'),
    (202, 'bar'),
    (172, 'xxx'),
    (172, 'yyy'),
],
['LU', 'Input'] # column labels
)

display(df)

First I have created a schema for the data before loading:

from pyspark.sql.types import *

data_schema = [
           StructField('LU', StringType(), True), 
           StructField('Input', StringType(), True)]

mySchema = StructType(fields=data_schema)

I then use the following code to read in the data:

df = spark.read.csv("/filepath/filename.txt", schema=mySchema , header=True)
df.show()

However, when I look at the data the first column looks fine, but the second column values show as null.

+----------+-----+
|        LU|Input|
+----------+-----+
|302       | null|
|203       | null|
|202       | null|
|202       | null|
|172       | null|
|172       | null|
+----------+-----+

Does anyone know why the 'Input' variable shows as null? This is just dummy data, when using real data that has 30+ variables only the first variables values ever load, everything else is null.

Thanks

@Shu I didn't think you can add a datafile to stack overflow? How do you do that? — Mrmoleje
@Mrmoleje: Try: df = spark.read.csv("/filepath/filename.txt", schema=mySchema , header=True, sep='\t'). You can just add some lines as plain text to your question. — cronoik

thebot thebot · Accepted Answer · 2020-04-20T09:55:12

To avoid this issue in the future, maybe consider inferring the schema at first and save it as json and for future reads, you can use this schema back. This will avoid making mistakes when creating the schema manually.

df.schema.json()

Pyspark SQL Issue loading a tsv file as a dataframe

3 Answers