I am trying to experiment with credit card fraud detection dataset through spark mllib. The dataset that I have has many 0's(meaning non-fraud) compared to 1's(meaning fraud). I wanted to know to solve a class imbalance problem like the above do we have any available algorithm in spark like SMOTE. I am using logistic regression as the model
1
votes
I did not tried it, but I was searching for the answer to the same question as you. I found an implementation (not tested/validated) of SMOTE in Spark: gist.github.com/hhbyyh/346467373014943a7f20df208caeb19b. Also, there is a discussion about same problem and a suggested solution is to use weights (stackoverflow.com/questions/33372838/…), but in the example, the classes are not so unbalanced as it would be in a fraud data set.
- waltersantosf
@waltersantosf thanks a lot!!
- Ayan Biswas
1 Answers
0
votes
You can try weightCol within logistic regression, Something like this:
temp = train.groupby("LabelCol").count()
new_train = train.join(temp, "LabelCol", how = 'leftouter')
num_labels = train_data.select(countDistinct(train_data.score)).first()[0]
train1 = new_train.withColumn("weight",(new_train.count()/(num_labels * new_train["count"])))
# Logistic Regrestion Initiation
lr = LogisticRegression(weightCol = "weight", family = 'multinomial')