Please look at the attached screenshot.
I am trying to do some performance improvement to my spark job and its taking almost 5 min to execute the take action on dataframe.
I am using take for making sure that dataframe has some records in it and if it is present, I want to proceed for further processing.
I tried take and count and don't see much difference in the time for execution.
Another scenario is where its taking around 10min to write the datafrane into hive table(it has max 200 rows and 10 columns).
df.write.mode("append").partitionBy("date").insertInto(tablename)
Please suggest how we can minimize the time its taking for take and insert into hive table.
Updates:
Here is my spark submit : spark-submit --master yarn-cluster --class com.xxxx.info.InfoAssets --conf "spark.executor.extraJavaOptions=-XX:+UseCompressedOops -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Djava.security.auth.login.config=kafka_spark_jaas.conf" --files /home/ngap.app.rcrp/hive-site.xml,/home//kafka_spark_jaas.conf,/etc/security/keytabs/ngap.sa.rcrp.keytab --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar --executor-memory 3G --num-executors 3 --executor-cores 10 /home/InfoAssets/InfoAssets.jar
- Code details:
its a simple dataframe which has 8 columns with around 200 records in it and I am using following code to insert into hive table.
df.write.mode("append").partitionBy("partkey").insertInto(hiveDB + "." + tableName)
Thanks,Bab
Dataframe
? - do you cache it before doing thetake
? – Luis Miguel Mejía Suárezspark-submit
and part of your code? – Alex Ding