0
votes

I'm trying to train a model using H2O.ai's H2O-3 Automl Algorithm on AWS SageMaker using the console.

My model's goal is to predict if an arrest will be made based upon the year, type of crime, and location.

My data has 8 columns:

  • primary_type: enum
  • description: enum
  • location_description: enum
  • arrest: enum (true/false), this is the target column
  • domestic: enum (true/false)
  • year: number
  • latitude: number
  • longitude: number

When I use the SageMaker console on AWS and create a new training job using the H2O-3 Automl Algorithm, I specify the primary_type, description, location_description, and domestic columns as categorical.

However in the logs of the training job I always see the following two lines:

Converting specified columns to categorical values:
[]

This leads me to believe the categorical_columns attribute in the training hyperparameter is not being taken into account.

I have tried the following hyperparameters with the same output in the logs each time:

{'classification': 'true', 'categorical_columns':'primary_type,description,location_description,domestic', 'target': 'arrest'}
{'classification': 'true', 'categorical_columns':['primary_type','description','location_description','domestic'], 'target': 'arrest'}

I thought the list of categorical columns was supposed to be delimited by comma, which would then be split into a list.

I expected the list of categorical column names to be output in the logs instead of an empty list, like so:

Converting specified columns to categorical values:
['primary_type','description','location_description','domestic']

Can anyone help me figure out how to get these categorical columns to apply to the training of my model?

Also- I think this is the code that's running when I train my model but I have yet to confirm that: https://github.com/h2oai/h2o3-sagemaker/blob/master/automl/automl_scripts/train#L93-L151

2

2 Answers

1
votes

This seems to be a bug by h2o package. The code in https://github.com/h2oai/h2o3-sagemaker/blob/master/automl/automl_scripts/train#L106 shows that it's reading categorical_columns directly from the hyperparameters, not nested under the training field. However when move up the categorical_columns field a level, the algorithm doesn't recognize it. So no solution for this.

0
votes

It seems based on the code here: https://github.com/h2oai/h2o3-sagemaker/blob/master/automl/automl_scripts/train#L106

that the parameter is looking for a comma separated string. E.g. "cat,dog,bird"

I would try: "primary_type,description,location_description,domestic"as the input parameter, rather than ['primary_type', 'description'... etc]