Apache Flink - Tweet Vectorization for SVM

Question

i am currently working on a hate speech filter using Apache Flink's FlinkML programmed in Scala.

I have a huge .csv training dataset containing rows like:

id,count,hate_speech,offensive_language,neither,class,tweet

326,3,0,1,2,2,"""@complex_uk: Ashley Young has tried to deny that bird s*** landed in his mouth ---&gt; http:**** https:****"" hahaha"

My Problem is, that Flink doesnt include a Vectorizer to transform the Tweets to a LibSVM File readable for the SVM.fit() function.

Do you guys have any idea how i could transform the data above using the "class"-column as a label and the "tweet"-column as the feature vector to train my SVM?

I really appreciate any help. Searching for hours.

As far as I know libSVM doesn't define on how you build your vector. So you have to come up with your own vector representation (e.g. TF/IDF or Doc2Vec) — TobiSH
@TobiSH Hey first of all thanks for your answer. Is there any TF/IDF or Doc2Vector Implementation i could integrate in my flink application through maven? — IboJaan
You can use Smile library: haifengl.github.io/smile haifengl.github.io/smile/api/java/smile/nlp/relevance/… — Emiliano Martinez

TobiSH TobiSH · Accepted Answer · 2019-10-20T13:51:37

I guess your problem is not (yet) a Flink problem. Flink is a stream processing engine (batch processing is also possible but stream processing is the unique selling point for flink). You can define stateful computations at an unbounded stream. How you do that is up to you. One of the first problems you need to solve is: How do I represent my text as a vector which can be used as an input for a SVM clustering. TF/IDF might be a good starting point. Implementations can be found all over the place: HaifenGL/SMLE or Deeplearning4j are some popular examples.

Please also keep in mind that if you deal with very short documents (twitter tweets if I got you right). You should consider to keep as much tokens (words) as possible - this will increase the size of your vocabulary - which will increase the dimensions of your vectors (if you stick with some kind of bag-of-words-like model) - which will force you to get more training data.

After solving all this ML related problems you can think of how do I integrate this into flink.

Apache Flink - Tweet Vectorization for SVM

1 Answers