While loading csv via databricks, below the 2nd row 4th column is not loaded. The csv's no of columns varies per row.
In test_01.csv,
a,b,c
s,d,a,d
f,s
Loaded above csv file via databricks as below
>>> df2 = sqlContext.read.format("com.databricks.spark.csv").load("sample_files/test_01.csv")
>>> df2.show()
+---+---+----+
| C0| C1| C2|
+---+---+----+
| a| b| c|
| s| d| a|
| f| s|null|
+---+---+----+
- Tried loading with textfile
rdd = sc.textFile ("sample_files/test_01.csv")
rdd.collect()
[u'a,b,c', u's,d,a,d', u'f,s']
But not conversion of above rdd to dataframe causes error
- Was able to solve by specifying the schema as below.
df2 = sqlContext.read.format("com.databricks.spark.csv").schema(schema).load("sample_files/test_01.csv")
df2.show()
+---+---+----+----+----+
| e1| e2| e3| e4| e5|
+---+---+----+----+----+
| a| b| c|null|null|
| s| d| a| d|null|
| f| s|null|null|null|
+---+---+----+----+----+
- Tried with inferschema. still not working
df2 = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").load("sample_files/test_01.csv")
df2.show()
+---+---+----+
| C0| C1| C2|
+---+---+----+
| a| b| c|
| s| d| a|
| f| s|null|
+---+---+----+
But is there any other way without using schema as the no of column varies?
.option("inferSchema", "true")
i.e.sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").load("sample_files/test_01.csv")
– Artem Trunov