I have a few questions related with the use of Apache Spark for real-time analytics using Java. When the Spark application is submitted, the data that are stored in Cassandra database are loaded and processed via a machine learning algorithm (Support Vector Machine). Throughout Spark's streaming extension when new data arrive, they are persisted in the database, the existing dataset is re-trained and the SVM algorithm is executed. The output of this process is also stored back in the database.
- Apache Spark's MLLib provides implementation of linear support vector machine. In case that I would like a non-linear SVM implementation, should I implement my own algorithm or may I use existing libraries such as libsvm or jkernelmachines? These implementations are not based on Spark's RDDs, is there a way to do this without implementing the algorithm from scratch using RDD collections? If not, that would be a huge effort if I would like to test several algorithms.
- Is MLLib providing out of the box utilities for data scaling before executing the SVM algorithm? http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf as defined in section 2.2
- While new dataset is streamed, do I need to re-train the hole dataset? Is there any way that I could just add the new data to the already trained data?