0
votes

In this documentation: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html#aws-glue-programming-etl-format-parquet

it mentions: "any options that are accepted by the underlying SparkSQL code can be passed to it by way of the connection_options map parameter."

However, how can I find out what those options are? There's not a clear mapping between the Glue code and the SparkQL code.

(Specifically, I want to figure out how to control the size of the resulting parquet files)

1
Unfortunately there is no such option to control size of parquet files. There is a trick using coalesce though. - Yuriy Bondaruk
Yeah :/ Apparently the closest we can get is to set repartition(n) prior to write out, which'll then produce n files (per partition key combo, if you're also using those) - Narfanator

1 Answers

1
votes

SparkSQL options for various DataSources can be looked up in DataFrameWriter documentation (in Scala or pyspark docs). Datasource for writing parquet seems to only take compression parameter. For SparkSQL options when reading the data, have a look into DataFrameReader class.

To control the size of your output files you should play with parallelism - like @Yuri Bondaruk commented - using for example coalesc function.