How to use multiple features for text in text classification?

Question

So, I have labeled tweets as retweeted or not retweeted and I have to use logistic regression to build a model to predict whether a tweet will be retweeted or not.

The problem I am facing is I don't know how to use multiple featured with logistic regression. The features I have to use are tf-idf, lda, whether a tweet has been retweeted, how many time tweets from a certain user have been retweeted in the past.

How can I use 4 features in binary classification? Any help would be greatly appreciated.

Which tool are you using for this problem (scikit-learn, tensorflow ... )? The procedure to use 2 features is the same as using 4 features, there is no difference. — Luis Leal
scikit learn. How would we go about it? any reference tutorial? — Faizan Ahmad
Can you share a little example of your dataset? This way i can help better — Luis Leal
Sure. I have a large number of labelled tweets about whether a tweet has been retweeted or not. I have to perform binary classification on whether a tweet will get retweeted or not. For this, I have to use multiple features i.e the total number of tweets of a particular user, tf-idf scores and a few more. How can I incorporate all these features. Right now, I am performing my analysis on just tf idf scores. — Faizan Ahmad
Hi, i meant and example of your dataset, something like a little example table, for example if you have your features: tf-idf , feature2, feature3 , and your target is "retweeted" , you will need 2 numpy structures: one with your features(one row per observation, one column per feature) and the other a vector with the output(for example 1= retweet ,0 = no retweet). Having done that, you instantiate a LogisticRegression object , and call "fit" on it.I cant post code in a comment, so i will post this in a new answer) — Luis Leal

Luis Leal Luis Leal · Accepted Answer · 2016-10-22T03:22:35

Heres just an example using the clasiffier default parameters, the idea is that the same procedure is used if you have two, or if you have more features:

dataset = np.ndarray(shape=(num_rows,3),dtype=np.float32) ;
retweeted_output = np.ndarray(shape=(num_rows,1),dtype=np.float32)
#perform some actions to fill your data structures
model = LogisticRegression(); 
model.fit(dataset,retweeted_output);

How to use multiple features for text in text classification?

1 Answers