0
votes

I have data incoming in batches and several columns that come from pivoting the values of another column, so the number of columns varies, one of the columns rarely receives any data ('surprise'), so I solved it like this and it seems to work so far, but I'm looking for a better way to solve this because this doesn't look like good code:

try:
    inner_join = agg_sentiment.join(agg_emotion, 
        agg_sentiment.topic_agg == agg_emotion.topic) \
        .select('created_at', 'topic', 'counts', 'positivity_rate', 
                'fear', 'joy', 'sadness', 'surprise', 'anger')

except Exception:
    inner_join = agg_sentiment.join(agg_emotion, 
        agg_sentiment.topic_agg == agg_emotion.topic) \
        .select('created_at', 'topic', 'counts', 'positivity_rate', 
                'fear', 'joy', 'sadness', 'anger')

As you can see I removed 'surprise' from the select statement in the except part. Is there a way in PySpark to handle this type of situations?

What exact behavior are you looking for? only join when having data?pltc
@pltc I just edited my post and rephrased it to make it more clear. Yeah, only join when having the data, but the number of columns varies depending on the values another column gets (I pivot the values of this column into these new columns, and one of them rarely gets created because it doesn't get that value often (called 'surprise' in my code))Doraemon