0
votes

I have a list of files in parquet format that I load and merge into one single dataframe in PySpark.

paths = ['file1', 'file2', 'file3']
df_list = map(lambda x:(spark.read.parquet(x)), paths)
df = reduce(lambda df1, df2: df1.unionAll(df2), df_list)

I would like to do the same operation using Scala. However when I using map operation on a Scala list of paths

val df_list = map(x = > (spark.read.parquet(x)), paths)

I am getting the following error:

:139: error: overloaded method value parquet with alternatives: (paths: String*)org.apache.spark.sql.DataFrame
(path: String)org.apache.spark.sql.DataFrame cannot be applied to (List[String]) val df_list = map(x = > (spark.read.parquet(x)), paths)

Any suggestion to resolve the issue will be appreciated.

2

2 Answers

1
votes

Try this:

val df_list = paths.map(x => spark.read.parquet(x))
val df = df_list.reduce(_.union(_))

Issue is that in Scala map and reduce are collection methods.

2
votes

The preferred approach here would be to use skip union and load data directly with varargs:

spark.read.parquet(paths: _*)