I wish to use pyspark.mllib.tree.RandomForest module to get a proximity matrix for my observations.
Until now, my data was small enough to be loaded directly into memory. I therefore used sklearn.ensemble.RandomForestClassifier to get the proximity matrix in the following way: suppose X is a matrix that contains the features and Y is a vector that contains the labels. I trained the random forest to distinguish between objects with label "0" and label "1". Having the trained random forest, I wanted to get a measure of proximity between every pair of observations in my dataset by counting in how many decision trees the two observations have got the same final node (=leaf). So, for 100 decision trees, the proximity measure between two observation can range from 0 (never fall in the same final leaf) and 100 (have fallen to the same final leaf in all the decision trees). The python implementation of this:
import numpy
from sklearn import ensemble
## data
print X.shape, Y.shape # X is a matrix that holds the 4281 features and contains 8562 observations and Y contains 8562 labels
>> (8562, 4281) (8562,)
## train the tree
n_trees = 100
rand_tree = sklearn.ensemble.RandomForestClassifier(n_estimators=n_tress)
rand_tree.fit(X, Y)
## get proximity matrix
apply_mat = rand_tree.apply(X)
obs_num = len(apply_mat)
sim_mat = numpy.eye(obs_num) * len(apply_mat[0]) # max values that they can be similar at = N estimators
for i in xrange(obs_num):
for j in xrange(i, obs_num):
vec_i = apply_mat[i]
vec_j = apply_mat[j]
sim_val = len(vec_i[vec_i==vec_j])
sim_mat[i][j] = sim_val
sim_mat[j][i] = sim_val
sim_mat_norm = sim_mat / len(apply_mat[0])
print sim_mat_norm.shape
>> (8562, 8562)
Now, I work with data that is too big to fit into the memory and therefore I decided to work with Spark instead. I am able to load the data and fit it but I did not find a way to "apply" the random forest to the data in order to get the proximity matrix. Is there any way to get it? (I use the same implementation as in the Spark documentation: https://spark.apache.org/docs/1.2.0/mllib-ensembles.html#classification):
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
I will also be happy to hear other ideas that can solve my problem. Thanks!