I oversee a research project where we aggregate newspaper articles on political violence in Africa, and then identify and code incidents. We keep track of where and when the incident took place, the actors involved, the number of people killed, etc. You can see the dataset here:
https://docs.google.com/spreadsheets/d/1_QYl4xhMu5nZVluprOgRs6rUzgkkBemapdsg5lFzKU/pubhtml
This is a labor intensive process and I think machine learning could be helpful. I'm trying to figure out the best approach.
My question: Am I better of using a set of keywords to decide how to code each article? I.e.
if "boko haram" in article:
code Boko Haram
or
if [list of locations] in article:
code location
Or can I use my existing dataset and the text from the articles and apply machine learning to do the feature extraction?
Some features are straightforward: if the article describes a violent event and Boko Haram is mentioned, we code Boko Haram. Or if a bomb is mentioned, we code bomb.
Some are more complicated. To determine if the event is "sectarian", we look for violent events where conflict between ethnic groups is referenced ('Fulani', 'Igbo', etc)
We code location based on a list of 774 districts. The challenge here is that there are often multiple spellings for the same place. Time is also complicated because the event is usually described as "last Tuesday," or "Wednesday night."
I did experiment with this a bit awhile ago using TextBlob's Naive Bayes Classifier to try to figure out location. I bumped into two problems. My program would either never finish. I'm assuming performing nlp on two thousand 500 word articles requires more juice than my Macbook Air can handle. The other was encoding issues with the article text. I'm hoping that switching to python 3 will help resolve this.
If I'm going to sink some time into this, I love some recommendations on the best path to take. If it is indeed machine learning, maybe i should be using something other than naive bayes? Maybe I should be running this in the cloud so I have more power? A different package from TextBlob?
Guidance is much appreciated!