I have a directory of directories on HDFS, and I want to iterate over the directories. Is there any easy way to do this with Spark using the SparkContext object?
9 Answers
You can use org.apache.hadoop.fs.FileSystem. Specifically, FileSystem.listFiles([path], true)
And with Spark...
FileSystem.get(sc.hadoopConfiguration).listFiles(..., true)
Edit
It's worth noting that good practice is to get the FileSystem that is associated with the Path's scheme.
path.getFileSystem(sc.hadoopConfiguration).listFiles(path, true)
Here's PySpark version if someone is interested:
hadoop = sc._jvm.org.apache.hadoop
fs = hadoop.fs.FileSystem
conf = hadoop.conf.Configuration()
path = hadoop.fs.Path('/hivewarehouse/disc_mrt.db/unified_fact/')
for f in fs.get(conf).listStatus(path):
print(f.getPath(), f.getLen())
In this particular case I get list of all files that make up disc_mrt.unified_fact Hive table.
Other methods of FileStatus object, like getLen() to get file size are described here:
@Tagar didn't say how to connect remote hdfs, but this answer did:
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
fs = FileSystem.get(URI("hdfs://somehost:8020"), Configuration())
status = fs.listStatus(Path('/some_dir/yet_another_one_dir/'))
for fileStatus in status:
print(fileStatus.getPath())
I had some issues with other answers(like 'JavaObject' object is not iterable), but this code works for me
fs = self.spark_contex._jvm.org.apache.hadoop.fs.FileSystem.get(spark_contex._jsc.hadoopConfiguration())
i = fs.listFiles(spark_contex._jvm.org.apache.hadoop.fs.Path(path), False)
while i.hasNext():
f = i.next()
print(f.getPath())
Scala FileSystem (Apache Hadoop Main 3.2.1 API)
import org.apache.hadoop.fs.{FileSystem, Path}
import scala.collection.mutable.ListBuffer
val fileSystem : FileSystem = {
val conf = new Configuration()
conf.set( "fs.defaultFS", "hdfs://to_file_path" )
FileSystem.get( conf )
}
val files = fileSystem.listFiles( new Path( path ), false )
val filenames = ListBuffer[ String ]( )
while ( files.hasNext ) filenames += files.next().getPath().toString()
filenames.foreach(println(_))
You can use below code to iterate recursivly through a parent HDFS directory, storing only sub-directories up to a third level. This is useful, if you need to list all directories that are created due to the partitioning of the data (in below code three columns were used for partitioning):
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
def rememberDirectories(fs: FileSystem, path: List[Path]): List[Path] = {
val buff = new ListBuffer[LocatedFileStatus]()
path.foreach(p => {
val iter = fs.listLocatedStatus(p)
while (iter.hasNext()) buff += iter.next()
})
buff.toList.filter(p => p.isDirectory).map(_.getPath)
}
@tailrec
def getRelevantDirs(fs: FileSystem, p: List[Path], counter: Int = 1): List[Path] = {
val levelList = rememberDirectories(fs, p)
if(counter == 3) levelList
else getRelevantDirs(fs, levelList, counter + 1)
}