1
votes

I am using SparkR and trying to convert a "SparkDataFrame" to "transactions" in order to mine association of items/ products.

I have found a similar example on this link https://blog.aptitive.com/building-the-transactions-class-for-association-rule-mining-in-r-using-arules-and-apriori-c6be64268bc4 but this is only if you are working with an R data.frame. I currently have my data in this format;

CUSTOMER_KEY_h PRODUCT_CODE

    1   SAVE
    1   CHEQ
    1   LOAN
    1   LOAN
    1   CARD
    1   SAVE
    2   CHEQ
    2   LOAN
    2   CTSAV
    2   SAVE
    2   CHEQ
    2   SAVE
    2   CARD
    2   CARD
    3   LOAN
    3   CTSAV
    4   SAVE
    5   CHEQ
    5   SAVE
    5   CARD
    5   LOAN
    5   CARD
    6   CHEQ
    6   CHEQ

and would like to end up with something like this;

CUSTOMER_KEY_h  PRODUCT_CODE
    1          {SAVE, CHEQ, LOAN, LOAN , CARD, SAVE}
    2          {CHEQ, LOAN, CTSAV, SAVE, CHEQ, SAVE, CARD, CARD}
    3          {LOAN, CTSAV}
    4          {SAVE}
    5          {CHEQ, SAVE, CARD, LOAN, CARD}
    6          {CHEQ, CHEQ}

Alternatively, If I can get the equivalent of this R script in SparkR df2 <- apply(df,2,as.logical) that would be helpful.

1

1 Answers

1
votes

arules package is not compatible with SparkR. If you want to explore association rules on Spark, you should use it's own utilities. First use collect_set to combine records:

library(magrittr)

df <- createDataFrame(data.frame(
  CUSTOMER_KEY_h = c(
    1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5, 5, 6, 6),
  PRODUCT_CODE = c(
    "SAVE","CHEQ","LOAN","LOAN","CARD","SAVE","CHEQ","LOAN","CTSAV","SAVE",
    "CHEQ","SAVE","CARD","CARD","LOAN","CTSAV","SAVE","CHEQ","SAVE","CARD","LOAN",
    "CARD","CHEQ","CHEQ")
))

baskets <- df %>% 
  groupBy("CUSTOMER_KEY_h") %>% 
  agg(alias(collect_set(column("PRODUCT_CODE")), "items"))

Fit the model (please check spark.fpGrowth docs for the full list of the available options):

fpgrowth <- spark.fpGrowth(baskets)

and use it to extract association rules:

arules <- fpgrowth <- spark.fpGrowth(baskets)

arules %>% head()
        antecedent consequent confidence lift                                   
1       CARD, LOAN       SAVE          1  1.5
2       CARD, LOAN       CHEQ          1  1.5
3 LOAN, SAVE, CHEQ       CARD          1  2.0
4       SAVE, CHEQ       LOAN          1  1.5
5       SAVE, CHEQ       CARD          1  2.0
6       CARD, SAVE       LOAN          1  1.5

If you use Spark < 2.3.0 you can try replacing:

alias(collect_set(column("PRODUCT_CODE")), "items")

with

expr("collect_set(PRODUCT_CODE) AS items")