While loading csv via databricks, below the 2nd row 4th column is not loaded. The csv's no of columns varies per row.
In test_01.csv,
Loaded above csv file via databricks as below
>>> df2 = sqlContext.read.format("com.databricks.spark.csv").load("sample_files/test_01.csv")
>>> df2.show()
| C0| C1| C2|
| a| b| c|
| s| d| a|
| f| s|null|
- Tried loading with textfile
rdd = sc.textFile ("sample_files/test_01.csv")
[u'a,b,c', u's,d,a,d', u'f,s']
But not conversion of above rdd to dataframe causes error
- Was able to solve by specifying the schema as below.
df2 = sqlContext.read.format("com.databricks.spark.csv").schema(schema).load("sample_files/test_01.csv")
| e1| e2| e3| e4| e5|
| a| b| c|null|null|
| s| d| a| d|null|
| f| s|null|null|null|
- Tried with inferschema. still not working
df2 = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").load("sample_files/test_01.csv")
| C0| C1| C2|
| a| b| c|
| s| d| a|
| f| s|null|
But is there any other way without using schema as the no of column varies?
.option("inferSchema", "true")
i.e.sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").load("sample_files/test_01.csv")
– Artem Trunov