I'm trying to find a fast way to do an affinity analysis on transactional market basket data with a few million number of rows.
What I've done so far:
- Created an R Server on top of Spark & Hadoop on cloud (Azure HDInsight)
- Loaded data on HDFS
- Get started with RevoScaleR
However, I got stuck at the last step. As far as I understand, I won't be able to process the data with the use of a function that is not provided within RevoScaleR.
Here is the code for accessing the data on HDFS:
bigDataDirRoot <- "/basket"
mySparkCluster <- RxSpark(consoleOutput=TRUE)
rxSetComputeContext(mySparkCluster)
hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort)
inputFile <-file.path(bigDataDirRoot,"gunluk")
So my infputFile is a CSV in an Azure Blob already created at /basket/gunluk
gunluk_data <- RxTextData(file = inputFile,returnDataFrame = TRUE,fileSystem = hdfsFS)
After running this, I am able to see the data using head(gunluk_data).
How can I manage to use gunluk_data with arules package functions. Is this possible?
If not, is it possible to process a CSV file that is in HDFS using regular R packages (i.e. arules) ?