1
votes

I am trying to experiment with credit card fraud detection dataset through spark mllib. The dataset that I have has many 0's(meaning non-fraud) compared to 1's(meaning fraud). I wanted to know to solve a class imbalance problem like the above do we have any available algorithm in spark like SMOTE. I am using logistic regression as the model

1
I did not tried it, but I was searching for the answer to the same question as you. I found an implementation (not tested/validated) of SMOTE in Spark: gist.github.com/hhbyyh/346467373014943a7f20df208caeb19b. Also, there is a discussion about same problem and a suggested solution is to use weights (stackoverflow.com/questions/33372838/…), but in the example, the classes are not so unbalanced as it would be in a fraud data set. - waltersantosf
@waltersantosf thanks a lot!! - Ayan Biswas

1 Answers

0
votes

You can try weightCol within logistic regression, Something like this:

    temp = train.groupby("LabelCol").count()
    new_train = train.join(temp, "LabelCol", how = 'leftouter')
    num_labels = train_data.select(countDistinct(train_data.score)).first()[0]
    train1 = new_train.withColumn("weight",(new_train.count()/(num_labels * new_train["count"])))
    # Logistic Regrestion Initiation
    lr = LogisticRegression(weightCol = "weight", family = 'multinomial')