3
votes

I'm using Ruby to implement Naive Bayes. I need to classify a text into a category (I have 4 different categories).

I tried to optimize it in several ways, but none seems to work. I removed the "Stopwords", did Stemmer in the words, parameterized, etc.

I did the training with 170 text data. But when I try to predict a new text, the result is often wrong. All 4 categories have a very similar probability at the end.

What else could I do to improve accuracy?

The code looks like this:

require 'nbayes'
require 'csv'
require 'active_support/all'
require 'lingua/stemmer'

def remove_stopwords(list)
  stopwords_array = []

  CSV.foreach("stopwords.csv") do |row|
    stopwords_array << row[0]
  end

  list - stopwords_array
end

def stemmer_array(list)
  stemmer = Lingua::Stemmer.new(:language => "pt")
  list.map {|x| stemmer.stem(x)}
end

def prepare_string(text)
  list = text.parameterize.split('-')
  list = remove_stopwords(list)
  stemmer_array(list)
end

nbayes = NBayes::Base.new

CSV.foreach("contacts.csv") do |row|
  if row[7] != "{:value=>nil, :label=>nil}"
    nbayes.train(prepare_string("#{row[4]} #{row[5]}"), row[7])
  end
end

new_text = "TEXT TO PREDICT"

result = nbayes.classify(prepare_string(new_text))

puts "Text: #{new_text}\n\n"

puts "´´´´´´´´´´´´´´´´´´´´´´´"
puts "Prediction: #{result.max_class}\n\n"
puts "´´´´´´´´´´´´´´´´´´´´´´´"
1
170 items just isn't enough data...cs95
Like Coldspeed said, 170 is likely not enough.. also you haven't told us what the categories are.. depending on how easy/difficult it is to separate them, the task might be very difficult even with large datasets. Also, "similar probabilities" at the end is meaningless. Typically naive bayes will return very low scores for all categories. (and btw, they aren't class probabilities - they are probabilities of observing the text given the class, that is why the score is so low). What matters is the category with the highest value, i.e: the class that was the most likely to generate the text.Pascal Soucy

1 Answers

0
votes

The dataset is very less to train a text classification model. Also be sure to check the distribution of target variables. As you mentioned there are 4-classes make sure that there's no class imbalance. For example if you have 100 data points for a single class and remaining data points for 3 different classes in that case your model will give such kind of output (where all the predicted classes belong to 1 class). Also plot a confusion matrix to see actually how good your model is performing.