0
votes

I have 220 GB of data. I have read it into spark dataframe as 2 columns : JournalID and Text. Now I have 27 lacks rows in my dataframe.

With NGram class, I have added two more columns Unigram and Bigram in dataframe containing unigrams and bigrams present in Text column. Then I compute TFIDF using TF and IDF class of pyspark over unigram and bigram columns and add it as one more column in dataframe.

Now I have journalID and TFIDF vector for each row in dataframe. I want to apply SVM with all types of kernels with TFIDF vector as feature and JournalID as label. Since multiclass SVM is not present in ML package of pyspark, I will have to use SVM implementation of Sklearn.

what will be best way to proceed further. Should I convert this big Dataframe into pandas dataframe and then apply sklearn algorithms over columns of pandas dataframe or there is some better way.

1
Most implementations of SVM dont support the incremental learning where you can distribute the data into parts and learn on them. Maybe thats why its not present in pyspark. And scikit-learn's SVM implementation does not support that too.Vivek Kumar

1 Answers

-2
votes

To learn SVM you don't need to pass all data to the classifier. Hence, you can sample the data (1M rows) with just necessary columns (for example you do not need the raw text) and then transform the sample data to the pandas dataframe.

If you want to train your model over the whole of data, you can load chunk of your data which have a proper size for your RAM space, and learn each time each chunk of your data into the model. In the otherwords, load for training and unload after training each chunk to prevent the problem of loading the whole of the data into RAM for the analysis.