how to read parquet files in pyspark as per the defined schema before read?

Question

I am reading parquet files altogether from s3 bucket in pyspark . There are some parquet files those having different schema and this is causing job error . I want to pass pre-defined schema and spark job should read only files matching with pre-defined scehma .

data = spark.read.parquet(*path_list)

above parquet spark read command is reading files in bulk . How is it possible to read only parquet files passing pre-defined schema and only those parquet files should be read matching with schema passed. Restriction is I need to achieve this with bulk load and that means passing list of file ( path_list ) to spark read parquet command.

Nikunj Kakadiya Nikunj Kakadiya · Accepted Answer · 2021-01-12T11:44:00

You can try something like this to make it work. First read all the files in the dataframe and then select the columns that you need from the existing dataframe that you have read plus filter out the rows based on values.

data = spark.read.option("mergeSchema", "true").parquet("data/")
selectedData = data.select("column1","column2",,).filter("column1".isNotNull)

I think doing above steps can get you moving.

how to read parquet files in pyspark as per the defined schema before read?

1 Answers