Was parsing a fixed width file and was running into some issues. Figured, I would just drop the first row which contains column names and then cast my own schema while setting the lengths for each column.
rdd = sc.textFile("file.txt")
data = rdd.filter(lambda x: not x.startswith("ID"))
schema = StructType([StructField(str(i), StringType(), True) for i in range(22)])
df = sqlContext.createDataFrame(data, schema)
df.printSchema()
root
|-- 0: string (nullable == true)
|-- 1: string (nullable == true)
etc
When I try to do something like:
df2 = df.select(
df.0.(1,500).alias("ID"),
df.1.(501, 510).alias("LOGS")
)
or
df2 = df.select(
df.value.substr(1,500).alias("ID"),
df.value.substr(501, 510).alias("LOGS")
)
It won't work because it doesn't seem to like the numerical columnn names which are automatically assigned after stripping out the first row. Typically I take care of these types of parsing issuesusing the second method however, being a fixed-width file seems to be causing an problems with any method I have tried in the past.
Essentially I need to be able to overlay the schema, set the lengths of each field for all 22 columns and ignore the first row with contains the file header and I'm unsure of the best way to tackle this.