3
votes

I oversee a research project where we aggregate newspaper articles on political violence in Africa, and then identify and code incidents. We keep track of where and when the incident took place, the actors involved, the number of people killed, etc. You can see the dataset here:

https://docs.google.com/spreadsheets/d/1_QYl4xhMu5nZVluprOgRs6rUzgkkBemapdsg5lFzKU/pubhtml

This is a labor intensive process and I think machine learning could be helpful. I'm trying to figure out the best approach.

My question: Am I better of using a set of keywords to decide how to code each article? I.e.

if "boko haram" in article:
     code Boko Haram

or 

if [list of locations] in article:
    code location

Or can I use my existing dataset and the text from the articles and apply machine learning to do the feature extraction?

Some features are straightforward: if the article describes a violent event and Boko Haram is mentioned, we code Boko Haram. Or if a bomb is mentioned, we code bomb.

Some are more complicated. To determine if the event is "sectarian", we look for violent events where conflict between ethnic groups is referenced ('Fulani', 'Igbo', etc)

We code location based on a list of 774 districts. The challenge here is that there are often multiple spellings for the same place. Time is also complicated because the event is usually described as "last Tuesday," or "Wednesday night."

I did experiment with this a bit awhile ago using TextBlob's Naive Bayes Classifier to try to figure out location. I bumped into two problems. My program would either never finish. I'm assuming performing nlp on two thousand 500 word articles requires more juice than my Macbook Air can handle. The other was encoding issues with the article text. I'm hoping that switching to python 3 will help resolve this.

If I'm going to sink some time into this, I love some recommendations on the best path to take. If it is indeed machine learning, maybe i should be using something other than naive bayes? Maybe I should be running this in the cloud so I have more power? A different package from TextBlob?

Guidance is much appreciated!

1

1 Answers

1
votes

Since posting my initial question, I've successfully applied the Naive Bayes and DecisionTree Classifiers from TextBlob, as well as the Naive Bayes and Support Vectors Machine from Sklearn. I should also add that Python3 and the correct encodings (for my dataset 'latin1') have eliminated my earlier string encoding and decoding issues.

The key for TextBlob was to build a custom feature extractor:

def simple_define_features(tokens):
    lga_state = pd.read_csv('lgas.csv')

    lga_state = lga_state[['State', 'LGA']]
    states = list(set(lga_state['State']))

    state_lga = {}
    for state in states:
        lgas = lga_state[lga_state['State']==state]
        lga_list = list(lgas['LGA'])
        state_lga[state.strip('State').rstrip()] = lga_list

    features = {}
    for state in list(state_lga.keys()):
        features['count({})'.format(state)] = tokens.count(state)
        features['has({})'.format(state)] = (state in tokens)
    for lga in lga_list:
        features['count({})'.format(lga)] = tokens.count(lga)
        features['has({})'.format(lga)] = (lga in tokens)

    return features

This function checks each article against a set of keywords, in this case, locations, and builds a feature dictionary. See the description of how a feature extractor works in the NLTK book: http://www.nltk.org/book/ch06.html

Currently, using the function below, I've been able to return a 75 percent accuracy for guessing the state level location. Keep in mind that my training set is rather small, only about 4000 rows.

The function is:

def tb_dt_classifier(json_file, feature_function, test_text, test_label):
    with open(json_file, 'r') as f:
        cl = DecisionTreeClassifier(f, format='json', feature_extractor=feature_function)
    test_text['guess'] = test_text.apply(lambda x: cl.classify(x))
    return test_text['guess']

TextBlob is very slow, however.

Sklearn has proven to be a lot faster. The difference, as far as I can tell, is that Sklearn, must have everything converted into vectors upfront. In this case, for the labels, I created dummy variables, which I've done for binary variables with pd.get_dummies()and astype('category').cat.codes when there is more than two variables. From there, Sklearn's count vectorizer creates the vectors.

Here is the function I've been using:

def text_classifier_SVM(df,train_text,train_output, test_text, test_output):
    text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                           alpha=1e-3, n_iter=5, random_state=42)),
    ])
    _ = text_clf.fit(df[train_text], df[train_output])
    predicted = text_clf.predict(df[test_text])
    return np.mean(predicted == df[test_output])

I still have a lot of tweaking to do, but this has started to return some meaningful results, and seems to be a lot more efficient then trying to guess every possible pattern and build it into some complex keyword search.