After reading the Spark source code, specially AlterTableRecoverPartitionsCommand in org.apache.spark.sql.execution.command.ddl.scala, which is the Spark implementation of ALTER TABLE RECOVER PARTITIONS. It's scan all the partitions, then register them.
So, here is the same idea, scan all the partitions from the location that we just wrote to.
Get the key names from it, then extract partition name/value from it.
Here is the code snippet to get the path.
String location = "s3n://somebucket/somefolder/dateid=20171010/";
Path root = new Path(location);
Configuration hadoopConf = sparkSession.sessionState().newHadoopConf();
FileSystem fs = root.getFileSystem(hadoopConf);
JobConf jobConf = new JobConf(hadoopConf, this.getClass());
final PathFilter pathFilter = FileInputFormat.getInputPathFilter(jobConf);
FileStatus[] fileStatuses = fs.listStatus(root, path -> {
String name = path.getName();
if (name != "_SUCCESS" && name != "_temporary" && !name.startsWith(".")) {
return pathFilter == null || pathFilter.accept(path);
} else {
return false;
}
});
for(FileStatus fileStatus: fileStatuses) {
System.out.println(fileStatus.getPath().getName());
}
ds.write.partitionBy.savefirst, it already write the data to all partitions. Did figure out a way to get that though. - Yang Jian