2
votes

I have my data in HDFS and it's schema in MySQL. I'm able to fetch the schema to a DataFrame and it is as below :

col1,string
col2,date
col3,int
col4,string

How to read this schema and assign it to data while reading from HDFS?

I will be reading schema from MySql . It will be different for different datasets . I require a dynamic approach , where for any dataset I can fetch schema details from MySQL -> convert it into schema -> and then apply to dataset.

1
I guess this is relevant to this stackoverflow.com/questions/39355149/… - Mohamed El-Touny
You can create a dataframe with the data of which schema you need. Create a variable with schema like schemadf =df.schema and while creating a new one just pass this variable like newDf = spark.createDataFrame(schemadf,data) - whatsinthename
Hi, as i will be reading schema from MySql . It will be different for different datasets . I require a dynamic approach , where for any dataset I can fetch schema details from MySQL -> convert it into schema -> and then apply to dataset. - isha

1 Answers

0
votes

You can use the built-in pyspark function _parse_datatype_string:

from pyspark.sql.types import _parse_datatype_string

df = spark.createDataFrame([
  ["col1,string"],
  ["col3,int"],
  ["col3,int"]
], ["schema"])

str_schema = ",".join(map(lambda c: c["schema"].replace(",", ":") , df.collect()))
# col1:string,col3:int,col3:int

final_schema = _parse_datatype_string(str_schema)

# StructType(List(StructField(col1,StringType,true),StructField(col3,IntegerType,true),StructField(col3,IntegerType,true)))

_parse_datatype_string expects a DDL-formatted string i.e: col1:string, col2:int hence we need first to replace , with : then join all together seperated by comma. The function will return an instance of StructType which will be your final schema.