I’d like to train a transformer encoder (e.g. BERT) on time-series data for a task that can be modeled as classification. Let met briefly describe the data I’m using before talking about the issue I’m facing.
I’m working with 90 seconds windows, and I have access to 100 values for each second (i.e. 90 vectors of size 100). My goal is to predict a binary label (0 or 1) for each second (i.e. produce a final vector of 0s ans 1s of length 90).
My first idea was to model this as a multi-label classification problem, where I would use BERT to produce a vector of size 90 filled with numbers between 0 and 1 and regress using nn.BCELoss and the groundtruth label (y_true looks like [0,0,0,1,1,1,0,0,1,1,1,0...,0]). A simple analogy would be to consider each second as a word, and the 100 values I have access to as the corresponding word embedding. The goal is then to train BERT (from scratch) on these sequences of 100-dim embedding (all sequence lengths are the same: 90).
The problem: when dealing with textual inputs, we simply add the CLS and SEP tokens to the input sequences, and let the tokenizer and the model do the rest of the job. When training directly on embeddings, what should we do to account for CLS and SEP tokens?
One idea I had was to add a 100-dim embedding at position 0 standing for the CLS token, as well as a 100-dim embedding on position 90+1=91 standing for the SEP token. But I don’t know what embeddings I should use for these two tokens. And I’m not sure that’s a good solution either.
Any ideas?
(I tried asking this question on Huggingface forums but didn't get any response.)
machine-learning
tag info. – desertnaut