8
votes

(I am testing my abilities to write short but effective questions so let me know how I do here)

I am trying to train/test a TensorFlow recurrent neural network, specifically an LSTM, with some trials of time-series data in the following ndarray format:

[[[time_step_trial_0, feature, feature, ...]
  [time_step_trial_0, feature, feature, ...]]                  
 [[time_step_trial_1, feature, feature, ...]
  [time_step_trial_1, feature, feature, ...]]
 [[time_step_trial_2, feature, feature, ...]
  [time_step_trial_2, feature, feature, ...]]]

The the 1d portion of this 3darray holds the a time step and all feature values that were observed at that time step. The 2d block contains all 1d arrays (time steps) that were observed in one trial. The 3d block contains all 2d blocks (trials) recorded for the time-series dataset. For each trial, the time step frequency is constant and the window interval is the same across all trials (0 to 50 seconds, 0 to 50 seconds, etc.).

For example, I am given data for Formula 1 race cars such as torque, speed, acceleration, rotational velocity, etc. Over a certain time interval recording time steps every 0.5 seconds, I form 1d arrays with each time step versus the recorded features recorded at that time step. Then I form a 2D array around all time steps corresponding to one Formula 1 race car's run on the track. I create a final 3D array holding all F1 cars and their time-series data. I want to train and test a model to detect anomalies in the F1 common trajectories on the course for new cars.

I am currently aware that the TensorFlow models support 2d arrays for training and testing. I was wondering what procedures I would have to go through in order the be able to train and test the model on all of the independent trials (2d) contained in this 3darray. In addition, I will be adding more trials in the future. So what are the proper procedures to go through in order to constantly be updating my model with the new data/trials to strengthen my LSTM.

Here is the model I was trying to initially replicate for a different purpose other than human activity: https://github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition. Another more feasible model would be this which I would much rather look at for anomaly detection in the time-series data: https://arxiv.org/abs/1607.00148. I want to build a anomaly detection model that given the set of non-anomalous time-series training data, we can detect anomalies in the test data where parts of the data over time is defined as "out of family."

2
Can you show us what you have tried so far? Consider writing a minimal reproducible example that further clarifies the inputs and expected outcomes of the specific issue that you are facing at the moment. Some parts of the question are unclear or too open-ended, such as asking "what procedures one would have to go through in order to train this model".E_net4 the curator
@E_net4 I have tried replicating Guillaume's LSTM but I get a dimensionality problem. I will post an example above of what I am looking for.Julian Rachman
@E_net4 updated.Julian Rachman
@JulianRachman thanks for the links. I managed to run LSTM-Human-Activity-Recognition. As you said "I am trying to replicate" did you manage to grasp their file stucture / input tensor? There are some reshapes.Clemens Tolboom
@ClemensTolboom eh sort of. I just wanted to know how this model deals with 3d arrays so that I may apply that to the second paper on unpredictable pattern anomaly detection. The second paper is the ultimate goal.Julian Rachman

2 Answers

4
votes

I think for most LSTM's you're going to want to think of your data in this way (as it will be easy to use as input for the networks).

You'll have 3 dimension measurements:

feature_size = the number of different features (torque, velocity, etc.)

number_of_time_steps = the number of time steps collected for a single car

number_of_cars = the number of cars

It will most likely be easiest to read your data in as a set of matrices, where each matrix corresponds to one full sample (all the time steps for a single car).

You can arrange these matrices so that each row is an observation and each column is a different parameter (or the opposite, you may have to transpose the matrices, look at how your network input is formatted).

So each matrix is of size: number_of_time_steps x feature_size (#rows x #columns). You will have number_of_cars different matrices. Each matrix is a sample.

To convert your array to this format, you can use this block of code (note, you can already access a single sample in your array with A[n], but this makes it so the shape of the accessed elements are what you expect):

import numpy as np

A = [[['car1', 'timefeatures1'],['car1', 'timefeatures2']],
     [['car2', 'timefeatures1'],['car2', 'timefeatures2']], 
     [['car3', 'timefeatures1'],['car3', 'timefeatures2']]
    ]

easy_format = np.array(A)

Now you can get an individual sample with easy_format[n], where n is the sample you want.

easy_format[1] prints

array([['car2', 'timefeatures1'],
       ['car2', 'timefeatures2']],
      dtype='|S12')

easy_format[1].shape = (2,2)

Now that you can do that, you can format them however you need for the network you're using (transposing rows and columns if necessary, presenting a single sample at a time or all of them at once, etc.)

What you're looking to do (if I'm reading that second paper correctly) most likely requires a sequence to sequence lstm or rnn. Your original sequence is your time series for a given trial, and you're generating an intermediate set of weights (an embedding) that can recreate that original sequence with a low amount of error. You're doing this for all the trials. You will train this lstm on a series of reasonably normal trials and get it to perform well (reconstruct the sequence accurately). You can then use this same set of embeddings to try to reconstruct a new sequence, and if it has a high reconstruction error, you can assume it's anomalous.

Check this repo for a sample of what you'd want along with explanations of how to use it and what the code is doing (it only maps a sequence of integers to another sequence of integers, but can easily be extended to map a sequence of vectors to a sequence of vectors): https://github.com/ichuang/tflearn_seq2seq The pattern you'd define is just your original sequence. You might also take a look at autoencoders for this problem.

Final Edit: Check this repository: https://github.com/beld/Tensorflow-seq2seq-autoencoder/blob/master/simple_seq2seq_autoencoder.py

I have modified the code in it very slightly to work on the newest version of tensorflow and to make some of the variable names clearer. You should be able to modify it to run on your dataset. Right now I'm just having it autoencode a randomly generated array of 1's and 0's. You would do this for a large subset of your data and then see if other data was reconstructed accurately or not (much higher error than average might imply an anomaly).

import numpy as np
import tensorflow as tf


learning_rate = 0.001
training_epochs = 30000
display_step = 100

hidden_state_size = 100
samples = 10
time_steps = 20
step_dims = 5
test_data = np.random.choice([ 0, 1], size=(time_steps, samples, step_dims))

initializer = tf.random_uniform_initializer(-1, 1)

seq_input = tf.placeholder(tf.float32, [time_steps, samples, step_dims])

encoder_inputs = [tf.reshape(seq_input, [-1, step_dims])]


decoder_inputs = ([tf.zeros_like(encoder_inputs[0], name="GO")]
                  + encoder_inputs[:-1])
targets = encoder_inputs
weights = [tf.ones_like(targets_t, dtype=tf.float32) for targets_t in targets]

cell = tf.contrib.rnn.BasicLSTMCell(hidden_state_size)
_, enc_state = tf.contrib.rnn.static_rnn(cell, encoder_inputs, dtype=tf.float32)
cell = tf.contrib.rnn.OutputProjectionWrapper(cell, step_dims)
dec_outputs, dec_state = tf.contrib.legacy_seq2seq.rnn_decoder(decoder_inputs, enc_state, cell)

y_true = [tf.reshape(encoder_input, [-1]) for encoder_input in encoder_inputs]
y_pred = [tf.reshape(dec_output, [-1]) for dec_output in dec_outputs]

loss = 0
for i in range(len(y_true)):
    loss += tf.reduce_sum(tf.square(tf.subtract(y_pred[i], y_true[i])))
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)

init = tf.initialize_all_variables()

with tf.Session() as sess:
    sess.run(init)
    x = test_data
    for epoch in range(training_epochs):
        #x = np.arange(time_steps * samples * step_dims)
        #x = x.reshape((time_steps, samples, step_dims))
        feed = {seq_input: x}
        _, cost_value = sess.run([optimizer, loss], feed_dict=feed)
        if epoch % display_step == 0:
            print "logits"
            a = sess.run(y_pred, feed_dict=feed)
            print a
            print "labels"
            b = sess.run(y_true, feed_dict=feed)
            print b

            print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(cost_value))

print("Optimization Finished!")
1
votes

Your input shape and the corresponding model depends on why type of Anomaly you want to detect. You can consider:

1. Feature only Anomaly: Here you consider individual features and decide whether any of them is Anomalous, without considering when its measured. In your example,the feature [torque, speed, acceleration,...] is an anomaly if one or more is an outlier with respect to the other features. In this case your inputs should be of form [batch, features].

2. Time-feature Anomaly: Here your inputs are dependent on when you measure the feature. Your current feature may depend on the previous features measured over time. For example there may be a feature whose value is an outlier if it appears at time 0 but not outlier if it appears furture in time. In this case you divide each of your trails with overlapping time windows and form a feature set of form [batch, time_window, features].

It should be very simple to start with (1) using an autoencoder where you train an auto-encoder and on the error between input and output, you can choose a threshold like 2-standard devations from the mean to determine whether its an outlier or not.

For (2), you can follow the second paper you mentioned using a seq2seq model, where your decoder error will determine which features are outliers. You can check on this for the implementation of such a model.