0
votes

While loading csv via databricks, below the 2nd row 4th column is not loaded. The csv's no of columns varies per row.

In test_01.csv,

a,b,c
s,d,a,d
f,s

Loaded above csv file via databricks as below

>>> df2 = sqlContext.read.format("com.databricks.spark.csv").load("sample_files/test_01.csv")
>>> df2.show()
+---+---+----+
| C0| C1|  C2|
+---+---+----+
|  a|  b|   c|
|  s|  d|   a|
|  f|  s|null|
+---+---+----+
  1. Tried loading with textfile

rdd = sc.textFile ("sample_files/test_01.csv")

rdd.collect()

[u'a,b,c', u's,d,a,d', u'f,s']

But not conversion of above rdd to dataframe causes error

  1. Was able to solve by specifying the schema as below.

df2 = sqlContext.read.format("com.databricks.spark.csv").schema(schema).load("sample_files/test_01.csv")

df2.show()

+---+---+----+----+----+
| e1| e2|  e3|  e4|  e5|
+---+---+----+----+----+
|  a|  b|   c|null|null|
|  s|  d|   a|   d|null|
|  f|  s|null|null|null|
+---+---+----+----+----+
  1. Tried with inferschema. still not working

df2 = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").load("sample_files/test_01.csv")

df2.show()

+---+---+----+
| C0| C1|  C2|
+---+---+----+
|  a|  b|   c|
|  s|  d|   a|
|  f|  s|null|
+---+---+----+

But is there any other way without using schema as the no of column varies?

1
can you try .option("inferSchema", "true") i.e. sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").load("sample_files/test_01.csv")Artem Trunov
already tried that.. not working..Eva Mariam

1 Answers

1
votes

Ensure you have fixed headers ie rows can have a data missing but column names should be fixed.

If you don't specify column names, you can still create the schema while reading the csv:

val schema = new StructType()
    .add(StructField("keyname", StringType, true))