I’ve got a pipe-delimited textfile without a header, and the rows have different numbers of columns (some rows are type A with 400 columns, others type B with 200, so I need to separate them first):
val textFileRaw = sc.textFile("./data.txt")
val textFile = textFileRaw.map(line => line.split("\\|", -1))
val dataA = textFile.filter(line => line(0) == "A")
val dataB = textFile.filter(line => line(0) == "B")
Now I'd like to convert these RDD's into Spark DataFrames, but the split is returning a single array, rather than 400 or 200 distinct values. This results in the following error:
# ANames are my column names, length=400
val ANames = Array("Row ID", "City", "State", ...)
val dataADF = dataA.toDF(ANames: _*)
Name: java.lang.IllegalArgumentException
Message: requirement failed: The number of columns doesn't match.
Old column names (1): value
New column names (400): Row ID, City, State ...
This question faces the same problem, but all the answers suggest manually specifying a mapping from array to Tuple, which isn't great in the case with hundreds of columns.
I think I could get it to work if I used Spark's csv loader, but this doesn't work for my data because the rows have different number of fields (it's not a true csv file). A work-around would be to first split the files, write new files that are well-formed csv's, and then use the csv loader, but I'd like to avoid this if possible. How can I convert these RDD's into DataFrames with named columns?