CSV Columns removed From file while loading Dataframe

Question

While loading csv via databricks, below the 2nd row 4th column is not loaded. The csv's no of columns varies per row.

In test_01.csv,

a,b,c
s,d,a,d
f,s

Loaded above csv file via databricks as below

>>> df2 = sqlContext.read.format("com.databricks.spark.csv").load("sample_files/test_01.csv")
>>> df2.show()
+---+---+----+
| C0| C1|  C2|
+---+---+----+
|  a|  b|   c|
|  s|  d|   a|
|  f|  s|null|
+---+---+----+

Tried loading with textfile

rdd = sc.textFile ("sample_files/test_01.csv")

rdd.collect()

[u'a,b,c', u's,d,a,d', u'f,s']

But not conversion of above rdd to dataframe causes error

Was able to solve by specifying the schema as below.

df2 = sqlContext.read.format("com.databricks.spark.csv").schema(schema).load("sample_files/test_01.csv")

df2.show()

+---+---+----+----+----+
| e1| e2|  e3|  e4|  e5|
+---+---+----+----+----+
|  a|  b|   c|null|null|
|  s|  d|   a|   d|null|
|  f|  s|null|null|null|
+---+---+----+----+----+

Tried with inferschema. still not working

df2 = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").load("sample_files/test_01.csv")

df2.show()

+---+---+----+
| C0| C1|  C2|
+---+---+----+
|  a|  b|   c|
|  s|  d|   a|
|  f|  s|null|
+---+---+----+

But is there any other way without using schema as the no of column varies?

can you try .option("inferSchema", "true") i.e. sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").load("sample_files/test_01.csv") — Artem Trunov

Hitesh Shahani Hitesh Shahani · Accepted Answer · 2019-01-31T10:42:43

Ensure you have fixed headers ie rows can have a data missing but column names should be fixed.

If you don't specify column names, you can still create the schema while reading the csv:

val schema = new StructType()
    .add(StructField("keyname", StringType, true))

CSV Columns removed From file while loading Dataframe

1 Answers