saveAsTextFile method in spark

Question

In my project, I have three input files and make the file names as args(0) to args(2), I also have a output filename as args(3), in the source code, I use

val sc = new SparkContext()
var log = sc.textFile(args(0))
for(i <- 1 until args.size - 1) log = log.union(sc.textFile(args(i)))

I do nothing to the log but save it as a text file by using

log.coalesce(1, true).saveAsTextFile(args(args.size - 1))

but it still save to 3 file as part-00000、part-00001、part-00002, So is there any way that I can save the three input files to an output file?

Is this your complete program? It looks okay, you should have something like one file part-00000 and _SUCCESS in the output directory. Note that the argument to the saveAsTextFile is actually a directory name, where the output is saved. — lpiepiora
Thanks a lot for your reply! In fact i do some action to the log. Just now, I try it and find that it is useful, maybe something error to my project, I will see it! — kemiya
I just tried this myself and I end up with only a single output. Are you running spark locally or in a cluster? — Mike Park

xhudik xhudik · Accepted Answer · 2015-01-02T15:09:56

Having multiple output files is a standard behavior of multi-machine clusters like Hadoop or Spark. The number of output files depends on the number of reducers.

How to "solve" it in Hadoop: merge output files after reduce phase

How to "solve" in Spark: how to make saveAsTextFile NOT split output into multiple file?

A good info you can get also here: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-Spark-merge-the-output-file-td322.html

So, you were right about coalesce(1,true). However, it is very inefficient. Interesting is that (as @climbage mentioned in his remark) your code is working if you run it locally.

What you might try is to read the files first and then save the output.

...
val sc = new SparkContext()
var str = new String("")
for(i <- 0 until args.size - 1){
   val file = sc.textFile(args(i))       
   file.foreach(line => str+= line)
}
//and now you might save the content
str.coalesce(1, true).saveAsTextFile("out")

Note: this code is also extremely inefficient and working for small files only!!! You need to come up with a better code. I wouldn't try to reduce number of file but process multiple outputs files instead.

saveAsTextFile method in spark

2 Answers