Spark Scala performance issues with take and InsertInto commands

Question

Please look at the attached screenshot.

I am trying to do some performance improvement to my spark job and its taking almost 5 min to execute the take action on dataframe.

I am using take for making sure that dataframe has some records in it and if it is present, I want to proceed for further processing.

I tried take and count and don't see much difference in the time for execution.

Another scenario is where its taking around 10min to write the datafrane into hive table(it has max 200 rows and 10 columns).

df.write.mode("append").partitionBy("date").insertInto(tablename)

Please suggest how we can minimize the time its taking for take and insert into hive table.

Updates:

Here is my spark submit : spark-submit --master yarn-cluster --class com.xxxx.info.InfoAssets --conf "spark.executor.extraJavaOptions=-XX:+UseCompressedOops -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Djava.security.auth.login.config=kafka_spark_jaas.conf" --files /home/ngap.app.rcrp/hive-site.xml,/home//kafka_spark_jaas.conf,/etc/security/keytabs/ngap.sa.rcrp.keytab --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar --executor-memory 3G --num-executors 3 --executor-cores 10 /home/InfoAssets/InfoAssets.jar

Code details:

its a simple dataframe which has 8 columns with around 200 records in it and I am using following code to insert into hive table.

df.write.mode("append").partitionBy("partkey").insertInto(hiveDB + "." + tableName)

Thanks,Bab

Are you sure there are only 200 rows. As total number of tasks are around 4k — Rishi Saraf
Yes. its only maximum 200 rows I am inserting into Hive table — Bab
From where are you loading the Dataframe? - do you cache it before doing the take ? — Luis Miguel Mejía Suárez
Could you provide more information like parameters of spark-submit and part of your code? — Alex Ding

Subhasish Guha Subhasish Guha · Accepted Answer · 2019-01-16T11:08:49

Don't use count before the write if not necessary and if your table is already created then use Spark SQL to insert the data into Hive Partitioned table.

spark.sql("Insert into <tgt tbl> partition(<col name>) select cols,partition col from temp_tbl")

Spark Scala performance issues with take and InsertInto commands

1 Answers