0
votes

I want to build a text classifier with sklearn and then convert it to iOS11 machine learning file using coremltools package. I've built three different classifiers with Logistic Regression, Random Forest, and Linear SVC and all of them work fine in Python. The problem is the coremltools package and the way it converts the sklearn model to an iOS file. As its documentation says, it only supports these models:

  • Linear and Logistic Regression
  • LinearSVC and LinearSVR
  • SVC and SVR
  • NuSVC and NuSVR
  • Gradient Boosting Classifier and Regressor
  • Decision Tree Classifier and Regressor
  • Random Forest Classifier and Regressor
  • Normalizer
  • Imputer
  • Standard Scaler
  • DictVectorizer
  • One Hot Encoder

So it doesn't allow me to vectorize my text dataset (I've used TfidfVectorizer package in my classifiers):

import coremltools
coreml_model = coremltools.converters.sklearn.convert(model, input_features='text', output_feature_names='category')

Traceback (most recent call last):

File "<ipython-input-3-97beddbdad10>", line 1, in <module>
    coreml_model = coremltools.converters.sklearn.convert(pipeline, input_features='Message', output_feature_names='Label')

  File "/usr/local/lib/python2.7/dist-packages/coremltools/converters/sklearn/_converter.py", line 146, in convert
    sk_obj, input_features, output_feature_names, class_labels = None)

  File "/usr/local/lib/python2.7/dist-packages/coremltools/converters/sklearn/_converter_internal.py", line 147, in _convert_sklearn_model
    for sk_obj_name, sk_obj in sk_obj_list]

  File "/usr/local/lib/python2.7/dist-packages/coremltools/converters/sklearn/_converter_internal.py", line 97, in _get_converter_module
    ",".join(k.__name__ for k in _converter_module_list)))

ValueError: Transformer 'TfidfVectorizer(analyzer='word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=3,
        ngram_range=(1, 2), norm=u'l2', preprocessor=None, smooth_idf=1,
        stop_words='english', strip_accents='unicode', sublinear_tf=1,
        token_pattern='\\w+', tokenizer=None, use_idf=1, vocabulary=None)' not supported; 
supported transformers are coremltools.converters.sklearn._dict_vectorizer,coremltools.converters.sklearn._one_hot_encoder,coremltools.converters.sklearn._normalizer,coremltools.converters.sklearn._standard_scaler,coremltools.converters.sklearn._imputer,coremltools.converters.sklearn._NuSVC,coremltools.converters.sklearn._NuSVR,coremltools.converters.sklearn._SVC,coremltools.converters.sklearn._SVR,coremltools.converters.sklearn._linear_regression,coremltools.converters.sklearn._LinearSVC,coremltools.converters.sklearn._LinearSVR,coremltools.converters.sklearn._logistic_regression,coremltools.converters.sklearn._random_forest_classifier,coremltools.converters.sklearn._random_forest_regressor,coremltools.converters.sklearn._decision_tree_classifier,coremltools.converters.sklearn._decision_tree_regressor,coremltools.converters.sklearn._gradient_boosting_classifier,coremltools.converters.sklearn._gradient_boosting_regressor.

Is there any way to build a sklearn text classifier and not use TfidfVectorizer or CountVectorizer models?

1

1 Answers

1
votes

Right now you can't include a tf-idf vectorizer in your pipeline if you want to convert it to the .mlmodel format. The way around this is to vectorize your data separately and then train the model (Linear SVC, Random Forest, ...) with the vectorized data. You need to then calculate the tf-idf representation on device which you can then plug into the model. Here's a copy of the tf-idf function I wrote.

func tfidf(document: String) -> MLMultiArray{
    let wordsFile = Bundle.main.path(forResource: "words_ordered", ofType: "txt")
    let dataFile = Bundle.main.path(forResource: "data", ofType: "txt")
    do {
        let wordsFileText = try String(contentsOfFile: wordsFile!, encoding: String.Encoding.utf8)
        var wordsData = wordsFileText.components(separatedBy: .newlines)
        let dataFileText = try String(contentsOfFile: dataFile!, encoding: String.Encoding.utf8)
        var data = dataFileText.components(separatedBy: .newlines)
        let wordsInMessage = document.split(separator: " ")
        var vectorized = try MLMultiArray(shape: [NSNumber(integerLiteral: wordsData.count)], dataType: MLMultiArrayDataType.double)
        for i in 0..<wordsData.count{
            let word = wordsData[i]
            if document.contains(word){
                var wordCount = 0
                for substr in wordsInMessage{
                    if substr.elementsEqual(word){
                        wordCount += 1
                    }
                }
                let tf = Double(wordCount) / Double(wordsInMessage.count)
                var docCount = 0
                for line in data{
                    if line.contains(word) {
                        docCount += 1
                    }
                }
                let idf = log(Double(data.count) / Double(docCount))
                vectorized[i] = NSNumber(value: tf * idf)
            } else {
                vectorized[i] = 0.0
            }
        }
        return vectorized
    } catch {
        return MLMultiArray()
    }
}

Edit: Wrote up a whole post on how to do this at http://gokulswamy.me/imessage-spam-detection/.