I want to read multiple parquet files from a folder which also contains some other file types(csv,avro) into a dataframe. I want to read only if its parquet and skip and go to next if any other. The problem is parquet file might not have extension and codec might also vary from file to file. In Spark-scala is there a way to do this?
1
votes
1 Answers
0
votes
You can get the filenames beforehand in the following way:
improt org.apache.spark.sql.DataFrame
import scala.sys.process._
val fileNames: List[String] = "hdfs dfs -ls /path/to/files/on/hdfs".!!
.split("\n")
.filter(_.endsWith(".parquet"))
.map(_.split("\\s").last).toList
val df: DataFrame = spark.read.parquet(fileNames:_*)
spark
in the above code is the SparkSession
object. This code should work for Spark 1.x versions as well since the method signature for parquet()
is the same in Spark 1.x and Spark 2.x versions.
spark.read.parquet(path)
, it will give you an exception when in encounters any other file type. The best way is to try to modify the flow to make sure every file has an extension, and iterate through the files and usefilter
to only read parquet files. If you can't do that, do the same, iterate through the files and use try-catch to skip the ones that give an exception, like in my answer to this question: stackoverflow.com/a/51042091/3000244 – Shikkouspark.read.parquet("/foldername/*/*/*.parquet")
here* and *
replaced by nested folder if you have more than two nested folder then add another*
like this/*/*/*/*.parquet
– Yogeshval path=List("path1","path2",....,"pathn"); spark.read.parquet(path:_*)
– Yogesh