BERT for time series classification

Question

I’d like to train a transformer encoder (e.g. BERT) on time-series data for a task that can be modeled as classification. Let met briefly describe the data I’m using before talking about the issue I’m facing.

I’m working with 90 seconds windows, and I have access to 100 values for each second (i.e. 90 vectors of size 100). My goal is to predict a binary label (0 or 1) for each second (i.e. produce a final vector of 0s ans 1s of length 90).

My first idea was to model this as a multi-label classification problem, where I would use BERT to produce a vector of size 90 filled with numbers between 0 and 1 and regress using nn.BCELoss and the groundtruth label (y_true looks like [0,0,0,1,1,1,0,0,1,1,1,0...,0]). A simple analogy would be to consider each second as a word, and the 100 values I have access to as the corresponding word embedding. The goal is then to train BERT (from scratch) on these sequences of 100-dim embedding (all sequence lengths are the same: 90).

The problem: when dealing with textual inputs, we simply add the CLS and SEP tokens to the input sequences, and let the tokenizer and the model do the rest of the job. When training directly on embeddings, what should we do to account for CLS and SEP tokens?

One idea I had was to add a 100-dim embedding at position 0 standing for the CLS token, as well as a 100-dim embedding on position 90+1=91 standing for the SEP token. But I don’t know what embeddings I should use for these two tokens. And I’m not sure that’s a good solution either.

Any ideas?

(I tried asking this question on Huggingface forums but didn't get any response.)

I’m voting to close this question because it is not about programming as defined in the help center but about ML theory and/or methodology - please see the intro & NOTE in the machine-learning tag info. — desertnaut

igodfried igodfried · Accepted Answer · 2021-02-24T02:53:17

While HuggingFace is very good for NLP I would not recommend using it for any time series problem. With respect to tokens there is no reason to use CLS nor SEP s you don't need them. The simplest way would be to feed the model data in the format (batch_size, seq_len, n_features) then have it predict (batch_size, seq_len) in this case it would look like (batch_size, 90, 100) and return a tensor of shape (batch_size, 90). That is unless you think there are temporal dependencies between windows. In which case you could use a rolling historical window. Secondly I suggest you look at some papers that discuss transformer for time series.

If you are looking for time series libraries that include the transformer check out Flow Forecast or transformer time series prediction for actual examples of using the transformer for time series data.

BERT for time series classification

1 Answers