How do you get a row back into a dataframe

Question

This was suppose to be simple test to move the first row of my dataframe into a new dataframe.

first issue df.first() returns a "row" not a dataframe. next problem, when I try to use spark.createDataFrame(df.first()) it will tell you that it can not infer the schema.

next problem spark.createDataFrame(df.first(), df.schema) does not work.

so for the original schema below:

root
 |-- entity_name: string (nullable = true)
 |-- field_name: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- data_row: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- data_schema: array (nullable = true)
 |    |-- element: string (containsNull = true)

I defined the schema in code thus:

xyz_schema = StructType([
 StructField('entity_name',StringType(),True)
 ,StructField('field_name',ArrayType(StringType(),True),True)
 ,StructField('data_row',ArrayType(StringType(),True),True)
 ,StructField('data_schema',ArrayType(StringType(),True),True)
])

print(xyz.first())
xyz_1stRow = spark.createDataFrame(xyz.first(), xyz_schema)

The above does not work! I get the following error:

"TypeError: StructType can not accept object 'parquet/assignment/v1' in type <class 'str'>"

this is what the print shows me...

Row(entity_name='parquet/assignment/v1', field_name=['Contract_ItemNumber', 'UPC', 'DC_ID', 'AssignDate', 'AssignID', 'AssignmentQuantity', 'ContractNumber', 'MaterialNumber', 'OrderReason', 'RequirementCategory', 'MSKU'], data_row=['\n
350,192660436296,2001,10/1/2019,84009248020191000,5,840092480,1862291010,711,V1\n\t\t\t\t\t', '\n
180,191454773838,2001,10/1/2019,84009248020191000,6,840092480,1791301010,711,V1\n\t\t\t\t\t'], data_schema=['StringType', 'StringType', 'StringType', None, 'StringType', 'IntegerType', 'StringType', 'StringType', 'StringType', 'StringType', 'StringType'])

What am I doing wrong? why does a stringtype not accept a string?

I'm working in pyspark (current version) with Azure databricks. I'd prefer to stay with pyspark, not R, not Scala, and not have to convert to pandas and risk my data being corrupted converting between all these languages.

cronoik cronoik · Accepted Answer · 2019-06-22T18:58:01

According to the documentation the createDataFrame function takes a RDD, a list or a pandas.DataFrame and creates a dataframe from it. Therefore you have to put the result of df.first in parentheses to make it a list. Have a look at the example below:

df = spark.createDataFrame(
    [('Galaxy', 2017, 27841, 17529),
     ('Galaxy', 2017, 29395, 11892),
     ('Novato', 2018, 35644, 22876),
     ('Novato', 2018, 8765,  54817)],
    ['model','year','price','mileage']
)

bla = spark.createDataFrame([df.first()])
bla.show()

Output:

+------+----+-----+-------+ 
| model|year|price|mileage| 
+------+----+-----+-------+ 
|Galaxy|2017|27841|  17529| 
+------+----+-----+-------+

How do you get a row back into a dataframe

1 Answers