We have a Spark Streaming application which is reading the data from Pubsub and applying some transformation and then convert the JavaDStream to Dataset and then write the results into BigQuery normalize tables.
Below is the sample Code. All the normalize tables are partitioned on CurrentTimestamp column. Is there any parameter we can set to improve write performance?
pubSubMessageDStream
.foreachRDD(new VoidFunction2<JavaRDD<PubSubMessageSchema>, Time>() {
@Override
public void call(JavaRDD<PubSubMessageSchema> v1, Time v2) throws Exception {
Dataset<PubSubMessageSchema> pubSubDataSet = spark.createDataset(v1.rdd(), Encoders.bean(PubSubMessageSchema.class));
---
---
---
for (Row payloadName : payloadNameList) {
Dataset<Row> normalizedDS = null;
if(payloadNameAList.contains(payloadName) {
normalizedDS = dataSet.filter(col(colA.equalTo(<Value>)));
} else if(payloadNameBList.contains(payloadName) {
normalizedDS = dataSet.filter(col(colA.equalTo(<Value>)));
}
normalizedDS.selectExpr(columnsBigQuery).write().format("bigquery")
.option("temporaryGcsBucket", gcsBucketName)
.option("table", tableName)
.option("project", projectId)
.option("parentProject", parentProjectId)
.mode(SaveMode.Append)
.save();
}
}
}