2
votes

When I try to convert by RDD to Dataframe in spark and I get following exception "Can not infer schema for type: "

example:

>> rangeRDD.take(1).foreach(println)
(301,301,10)
>> sqlContext.inferSchema(rangeRDD)
Can not infer schema for type: <type 'unicode'>

Any pointer how to fix it? I even tried injecting schema myself in sqlContext.createDataFrame(rdd, schema)

schema = StructType([
StructField("x", IntegerType(), True),
StructField("y", IntegerType(), True),
StructField("z", IntegerType(), True)]) 
df = sqlContext.createDataFrame(rangeRDD, schema)
print df.first()

but ended up in runtime error 'ValueError: Unexpected tuple u'(301,301,10)' with StructType'

1

1 Answers

2
votes

Try parsing data first

>>> rangeRDD = sc.parallelize([ u'(301,301,10)'])
>>> tupleRangeRDD = rangeRDD.map(lambda x: x[1:-1]) \
...                        .map(lambda x: x.split(",")) \
...                        .map(lambda x: [int(y) for y in x])
>>> df = sqlContext.createDataFrame(tupleRangeRDD, schema)
>>> df.first()
Row(x=301, y=301, z=10)