0
votes

Please look at the attached screenshot.

I am trying to do some performance improvement to my spark job and its taking almost 5 min to execute the take action on dataframe.

I am using take for making sure that dataframe has some records in it and if it is present, I want to proceed for further processing.

I tried take and count and don't see much difference in the time for execution.

Another scenario is where its taking around 10min to write the datafrane into hive table(it has max 200 rows and 10 columns).

df.write.mode("append").partitionBy("date").insertInto(tablename)

Please suggest how we can minimize the time its taking for take and insert into hive table.

enter image description here

Updates:

Here is my spark submit : spark-submit --master yarn-cluster --class com.xxxx.info.InfoAssets --conf "spark.executor.extraJavaOptions=-XX:+UseCompressedOops -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Djava.security.auth.login.config=kafka_spark_jaas.conf" --files /home/ngap.app.rcrp/hive-site.xml,/home//kafka_spark_jaas.conf,/etc/security/keytabs/ngap.sa.rcrp.keytab --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar --executor-memory 3G --num-executors 3 --executor-cores 10 /home/InfoAssets/InfoAssets.jar

  • Code details:

its a simple dataframe which has 8 columns with around 200 records in it and I am using following code to insert into hive table.

df.write.mode("append").partitionBy("partkey").insertInto(hiveDB + "." + tableName)

Thanks,Bab

1
Are you sure there are only 200 rows. As total number of tasks are around 4kRishi Saraf
Yes. its only maximum 200 rows I am inserting into Hive tableBab
From where are you loading the Dataframe? - do you cache it before doing the take ?Luis Miguel Mejía Suárez
Yes, I am caching..Bab
Could you provide more information like parameters of spark-submit and part of your code?Alex Ding

1 Answers

0
votes

Don't use count before the write if not necessary and if your table is already created then use Spark SQL to insert the data into Hive Partitioned table.

spark.sql("Insert into <tgt tbl> partition(<col name>) select cols,partition col from temp_tbl")