3
votes

I would like to build a model (RNN >> LSTM) with an Embedding layer for a categorical feature (Item ID), My training set looks so:

train_x = [[[184563.1], [184324.1], [187853.1], [174963.1], [181663.1]], [[…],[…],[…],[…],[…]], …]

I predict the sixth item ID:

train_y = [0,1,2, …., 12691]

I have 12692 unique item IDs, length of timesteps = 5 and this is a classification task.

This is a brief summary for what I've done so far: (Please correct me if I'm wrong)

  1. One-hot-encoding for the categorical feature:

train_x = [[[1 0 0 … 0 0 0], [0 1 0 … 0 0 0], [0 0 1 … 0 0 0], […], […]], [[…],[…],[…],[…],[…]], …]

  1. Build model:
model = Sequential()

model.add(Embedding(input_dim=12692 , output_dim=250, input_length=5))

model.add(LSTM(128, return_sequences=True)
model.add(Dropout(0.2)) 
model.add(BatchNormalization())

model.add(LSTM(128, return_sequences=True))
model.add(Dropout(0.1)) 
model.add(BatchNormalization())

model.add(LSTM(128))
model.add(Dropout(0.2)) 
model.add(BatchNormalization())

model.add(Dense(32, activation='relu'))
model.add(Dropout(0.2))

model.add(Dense(12692, activation='softmax'))

opt = tf.keras.optimizers.Adam(lr=0.001, decay=1e-6)
model.compile(
      loss='sparse_categorical_crossentropy',
      optimizer=opt,
      metrics=['accuracy'])

print(model.summary())

history = model.fit(
      train_x, train_y,
      batch_size=64,
      epochs=epochs,
      validation_data=(validation_x, validation_y))

score = model.evaluate(validation_x, validation_y, verbose=0)

I get this model summary:

enter image description here

Train on 131204 samples, validate on 107904 samples

But after that, this error appears:

ValueError: Error when checking input: expected embedding_input to have 2 dimensions, but got array with shape (131204, 5, 12692)

Where is my mistake and what would be the solution?

1

1 Answers

1
votes

The embedding layer turns positive integers (indexes) into dense vectors of fixed size (Docs). So your train_x is not one-hot-encoded but the integer representing its index in the vocab. It will be the integer corresponding to the categorical feature.

train_x.shape will be (No:of sample X 5) --> Each representing the index of of the categorical feature

train_y.shape will be (No:of sample) --> Each representing the index of the sixth item in your time series.

Working sample

import numpy as np
import keras
from keras.layers import Embedding, LSTM, Dense
n_samples = 100

train_x = np.random.randint(0,12692,size=(n_samples ,5))
train_y = np.random.randint(0,12692,size=(n_samples))


model = keras.models.Sequential()

model.add(Embedding(input_dim=12692+1, output_dim=250, input_length=5))
model.add(LSTM(128, return_sequences=True))
model.add(LSTM(32))
model.add(Dense(32, activation='relu'))
model.add(Dense(12692, activation='softmax'))

opt = keras.optimizers.Adam(lr=0.001, decay=1e-6)
model.compile(
      loss='sparse_categorical_crossentropy',
      optimizer=opt,
      metrics=['accuracy'])

print(model.summary())

history = model.fit(
      train_x, train_y,
      batch_size=64,
      epochs=32)