I have data incoming in batches and several columns that come from pivoting the values of another column, so the number of columns varies, one of the columns rarely receives any data ('surprise'), so I solved it like this and it seems to work so far, but I'm looking for a better way to solve this because this doesn't look like good code:
try:
inner_join = agg_sentiment.join(agg_emotion,
agg_sentiment.topic_agg == agg_emotion.topic) \
.select('created_at', 'topic', 'counts', 'positivity_rate',
'fear', 'joy', 'sadness', 'surprise', 'anger')
except Exception:
inner_join = agg_sentiment.join(agg_emotion,
agg_sentiment.topic_agg == agg_emotion.topic) \
.select('created_at', 'topic', 'counts', 'positivity_rate',
'fear', 'joy', 'sadness', 'anger')
As you can see I removed 'surprise' from the select statement in the except part. Is there a way in PySpark to handle this type of situations?