I have 220 GB of data. I have read it into spark dataframe as 2 columns : JournalID and Text. Now I have 27 lacks rows in my dataframe.
With NGram class, I have added two more columns Unigram and Bigram in dataframe containing unigrams and bigrams present in Text column. Then I compute TFIDF using TF and IDF class of pyspark over unigram and bigram columns and add it as one more column in dataframe.
Now I have journalID and TFIDF vector for each row in dataframe. I want to apply SVM with all types of kernels with TFIDF vector as feature and JournalID as label. Since multiclass SVM is not present in ML package of pyspark, I will have to use SVM implementation of Sklearn.
what will be best way to proceed further. Should I convert this big Dataframe into pandas dataframe and then apply sklearn algorithms over columns of pandas dataframe or there is some better way.