0
votes

Was parsing a fixed width file and was running into some issues. Figured, I would just drop the first row which contains column names and then cast my own schema while setting the lengths for each column.

rdd = sc.textFile("file.txt")
data = rdd.filter(lambda x: not x.startswith("ID"))
schema = StructType([StructField(str(i), StringType(), True) for i in range(22)])
df = sqlContext.createDataFrame(data, schema)
df.printSchema()

root
|-- 0: string (nullable == true)
|-- 1: string (nullable == true)

etc

When I try to do something like:

df2 = df.select(
df.0.(1,500).alias("ID"),
df.1.(501, 510).alias("LOGS")
)

or

df2 = df.select(
df.value.substr(1,500).alias("ID"),
df.value.substr(501, 510).alias("LOGS")
)

It won't work because it doesn't seem to like the numerical columnn names which are automatically assigned after stripping out the first row. Typically I take care of these types of parsing issuesusing the second method however, being a fixed-width file seems to be causing an problems with any method I have tried in the past.

Essentially I need to be able to overlay the schema, set the lengths of each field for all 22 columns and ignore the first row with contains the file header and I'm unsure of the best way to tackle this.

1
You will have to use df['0'] and df['1'] to refer to columns - sramalingam24
On the other hand, what does your data look like, maybe post a sample - sramalingam24

1 Answers

0
votes

What is the error you get? i see no problem in column names because you are casting them to string with str function. The problem is with this code.

df2 = df.select(
df.value.(1,500).alias("ID"),
df.value.(501, 510).alias("LOGS")
)

In order to get a string from a string use substr function. So, your code will be

df2 = df.select(
df.value.substr(1,500).alias("ID"),
df.value.substr(501, 510).alias("LOGS")
)