0
votes

I'm trying to find a fast way to do an affinity analysis on transactional market basket data with a few million number of rows.

What I've done so far:

  • Created an R Server on top of Spark & Hadoop on cloud (Azure HDInsight)
  • Loaded data on HDFS
  • Get started with RevoScaleR

However, I got stuck at the last step. As far as I understand, I won't be able to process the data with the use of a function that is not provided within RevoScaleR.

Here is the code for accessing the data on HDFS:

bigDataDirRoot <- "/basket" 
mySparkCluster <- RxSpark(consoleOutput=TRUE)
rxSetComputeContext(mySparkCluster)
hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort)
inputFile <-file.path(bigDataDirRoot,"gunluk")

So my infputFile is a CSV in an Azure Blob already created at /basket/gunluk

gunluk_data <- RxTextData(file = inputFile,returnDataFrame = TRUE,fileSystem = hdfsFS)

After running this, I am able to see the data using head(gunluk_data).

How can I manage to use gunluk_data with arules package functions. Is this possible?

If not, is it possible to process a CSV file that is in HDFS using regular R packages (i.e. arules) ?

1

1 Answers

0
votes

In arules you can use read.transactions to read the data from files and write.PMML to write out rules/itemsets.