4
votes

I am rather new at using Spark and I am having issues running a simple word count application on a standalone cluster. I have a cluster consisting of one master node and one worker, launched on AWS using the spark-ec2 script. Everything works fine when I run the code locally using ./bin/spark-submit --class com.spark.SparkDataAnalysis --master local[*] ./uber-ingestion-0.0.1-SNAPSHOT.jar file:///root/textfile.txt s3n://bucket/wordcount

This saves the output into the specified directory as it should.

When I try to run the application using ./bin/spark-submit --class com.spark.SparkDataAnalysis --master spark://server-ip:7077 ./uber-ingestion-0.0.1-SNAPSHOT.jar file:///root/textfile.txt s3n://bucket/wordcount

it just keeps on running and never produce a final result. The directory gets created but only a temporary file of 0 bytes is present.

According to the Spark UI it keeps on running the mapToPair function indefinitely. Here is a picture of the Spark UI

Does anyone know why this is happening and how to solve it?

Here is the code:

public class SparkDataAnalysis {
    public static void main(String args[]){
        SparkConf conf = new SparkConf().setAppName("SparkDataAnalysis");
        JavaSparkContext sc = new JavaSparkContext(conf);

        JavaRDD<String> input = sc.textFile( args[0] );

        JavaRDD<String> words = input.flatMap( s -> Arrays.asList( s.split( " " ) ) );

        JavaPairRDD<String, Integer> counts = words.mapToPair( t -> new Tuple2<String, Integer>( t, 1 ) ).reduceByKey( (x, y) -> x + y );

        counts.saveAsTextFile( args[1] );
    }
}
1
is your bucket protected by aws access key and secret key ? If so how are u providing it ?Knight71
The bucket is protected. I am providing the keys by exporting them as export AWS_ACCESS_KEY_ID= export AWS_SECRET_ACCESS_KEY= at each Spark node.M. Persson
can you try to explicitly set it in code ? val hadoopConf = sc.hadoopConfiguration; hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey) hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)Knight71
I added the configuration to the code, but unfortunately the problem still remains.M. Persson

1 Answers

0
votes

I skipped using a standalone cluster via the spark-ec2 script and used Amazon EMR instead. There everything worked perfectly.