I am running the spark pipe function on EMR master server in REPL just to test out the pipe functionality. I am using the following examples
https://stackoverflow.com/a/32978183/8876462
http://blog.madhukaraphatak.com/pipe-in-spark/
http://hadoop-makeitsimple.blogspot.com/2016/05/pipe-in-spark.html
This is my code ::
import org.apache.spark._
val distScript = "/home/hadoop/PipeEx.sh"
val distScriptName = "PipeEx.sh"
sc.addFile(distScript)
val ipData =
sc.parallelize(List("asd","xyz","zxcz","sdfsfd","Ssdfd","Sdfsf"))
val opData = ipData.pipe(SparkFiles.get(distScriptName))
opData.foreach(println)
I have tried different things like making the file executable, placed in file in /usr/lib/spark/bin as suggested in another post. I changed the distScript to say
"file:///home/hadoop/PipeEx.sh"
I always get no such file or directory in tmp/spark*/userFiles* location. I have tried to access and run the shell program from the tmp location and it runs fine.
My shell script is the same as http://blog.madhukaraphatak.com/pipe-in-spark/
Here is the first part of the log::
[Stage 9:> (0 + 2)
/ 2]18/03/19 19:58:22 WARN TaskSetManager: Lost task 1.0 in stage 9.0 (TID
72, ip-172-31-42-11.ec2.internal, executor 9): java.io.IOException: Cannot
run program "/mnt/tmp/spark-bdd582ec-a5ac-4bb1-874e-832cd5427b18/userFiles-
497f6051-6f49-4268-b9c5-a28c2ad5edc6/PipeEx.sh": error=2, No such file or
directory
Does any one have any idea? I am using Spark 2.2.1 and scala 2.11.8
Thanks