I'm using Ruby to implement Naive Bayes. I need to classify a text into a category (I have 4 different categories).
I tried to optimize it in several ways, but none seems to work. I removed the "Stopwords", did Stemmer in the words, parameterized, etc.
I did the training with 170 text data. But when I try to predict a new text, the result is often wrong. All 4 categories have a very similar probability at the end.
What else could I do to improve accuracy?
The code looks like this:
require 'nbayes'
require 'csv'
require 'active_support/all'
require 'lingua/stemmer'
def remove_stopwords(list)
stopwords_array = []
CSV.foreach("stopwords.csv") do |row|
stopwords_array << row[0]
end
list - stopwords_array
end
def stemmer_array(list)
stemmer = Lingua::Stemmer.new(:language => "pt")
list.map {|x| stemmer.stem(x)}
end
def prepare_string(text)
list = text.parameterize.split('-')
list = remove_stopwords(list)
stemmer_array(list)
end
nbayes = NBayes::Base.new
CSV.foreach("contacts.csv") do |row|
if row[7] != "{:value=>nil, :label=>nil}"
nbayes.train(prepare_string("#{row[4]} #{row[5]}"), row[7])
end
end
new_text = "TEXT TO PREDICT"
result = nbayes.classify(prepare_string(new_text))
puts "Text: #{new_text}\n\n"
puts "´´´´´´´´´´´´´´´´´´´´´´´"
puts "Prediction: #{result.max_class}\n\n"
puts "´´´´´´´´´´´´´´´´´´´´´´´"