1
votes

I have the following data in as a .txt file in tab separated format stored in my blob storage. I'm using pyspark.sql to load the data into databricks as a pyspark.sql.df.

This is the shape of the data.

df = spark.createDataFrame(
    [
    (302, 'foo'), # values
    (203, 'bar'),
    (202, 'foo'),
    (202, 'bar'),
    (172, 'xxx'),
    (172, 'yyy'),
],
['LU', 'Input'] # column labels
)

display(df)

First I have created a schema for the data before loading:

from pyspark.sql.types import *

data_schema = [
           StructField('LU', StringType(), True), 
           StructField('Input', StringType(), True)]

mySchema = StructType(fields=data_schema)

I then use the following code to read in the data:

df = spark.read.csv("/filepath/filename.txt", schema=mySchema , header=True)
df.show() 

However, when I look at the data the first column looks fine, but the second column values show as null.

+----------+-----+
|        LU|Input|
+----------+-----+
|302       | null|
|203       | null|
|202       | null|
|202       | null|
|172       | null|
|172       | null|
+----------+-----+

Does anyone know why the 'Input' variable shows as null? This is just dummy data, when using real data that has 30+ variables only the first variables values ever load, everything else is null.

Thanks

3
could you add sample records from file?notNull
@Shu I'm not sure what you mean?Mrmoleje
add some sample data from /filepath/filename.txt file..notNull
@Shu I didn't think you can add a datafile to stack overflow? How do you do that?Mrmoleje
@Mrmoleje: Try: df = spark.read.csv("/filepath/filename.txt", schema=mySchema , header=True, sep='\t'). You can just add some lines as plain text to your question.cronoik

3 Answers

1
votes

To avoid this issue in the future, maybe consider inferring the schema at first and save it as json and for future reads, you can use this schema back. This will avoid making mistakes when creating the schema manually.

df.schema.json()
0
votes

I worked out what the issue was with my data. In my schema I had:

StructField('Date', DateType()

on a date value like 13/03/2000. You can't force this kind of date into DateType in a schema, only 13-03-2000.

This means that when applying the schema to all variables they all returned nulls, even though only one variable had failed.

Hope that helps some.

0
votes

Since you already have header in file, why don't you let Spark to infer the schema. I tried with your sample data and it gave the correct result.

>>> df = spark.read.csv("file:////Users/sam/Downloads/file.txt",  inferSchema=True, header=True, sep='\t')
>>> df.show()
+---+-----+
| LU|Input|
+---+-----+
|302| foo |
|203|  bar|
|202|  foo|
|202|  bar|
|172|  xxx|
|172|  yyy|
+---+-----+

>>> df.printSchema()
root
 |-- LU: integer (nullable = true)
 |-- Input: string (nullable = true)

Databricks some time does not show the correct result. So let Spark infer the data 1st and then you try to understand what is the issue with your schema and then take corrective action.

I suspect you must you LongType for 1st field but compare it. Attaching a sample for your reference. Since I am not sure about the actual file, I am just pointing you in that direction. enter image description here