1
votes

I'm using omnicat-bayes to analyse documents (text-classification). With this gem I'm able to create categories and "feed" those with documents. Currently the categories have enough documents in order to be "good enough" to recognize new documents in what category they should be placed in.

Now in my Documents controller under the create action are a few steps.

  1. Creating a new Bayes instance
  2. Creating the categories that will be used
  3. Taking the pre-documents to train the categories
  4. Actually training the categories

(all of those steps are under the run_all function)

The create action:

def create
@document = Document.new(document_params)
@document.case_id = @case.id
 if @document.save
   run_all
   # Running the classify function on reden aanmelding
   classify_one = @bayes.classify(@document.reden_aanmelding)
   document_category = classify_one.to_hash[:top_score_key]
   # Updating the document category by the top key returned by Bayes
   @document.update_attribute(:category, document_category)
   finding_required_records
   # Training Cees Buddy with the document that got saved
   @bayes.train(document_category, @document.reden_aanmelding)
   redirect_to case_path(@case)
 else
   render :new
 end
end

Inside the @document.save run_all function (I know this isn't really best practice) I'm creating the four steps named above.

Now after the create function is finished the Bayes instance is gone and the AI is now "stupid" again so to speak.

My question is: what would a proper place be and how can I accomplish this to create the new instance, new categories and feed them with documents out of my database. Would a singleton be interesting here?

1

1 Answers

1
votes

This is quite a tricky problem, given that you'll probably want to scale the application to deal with more than a handful of documents.

The thing is that a production-mode Rails application web-server will usually fork into multiple processes or even run on more than one machine. Which means that documents trained in one process will be unknown on all the others, even if you use a singleton pattern.

So with only the omnicat-bayes gem, the best way to go about it is to create some kind of separate micro service that runs in its own process and does nothing more than process documents. The main application should then enqueue the processing into asynchronous jobs so it is okay if things take a bit longer in case the training process is busy with other documents.

How you communicate with this external OmniCat instance is up to you. The most comfortable way might be dRuby but I should add that I have no production-mode experience with it. A more future-proof solution would be to use some simple HTTP + JSON. In that case you could even switch out the service that does training and categorisation with some more powerful library that's not based on Ruby in the future.