Using Logistic Regression For Timeseries Data in Amazon SageMaker

Question

For a project I am working on, which uses annual financial reports data (of multiple categories) from companies which have been successful or gone bust/into liquidation, I previously created a (fairly well performing) model on AWS Sagemaker using a multiple linear regression algorithm (specifically, the AWS stock algorithm for logistic regression/classification problems - the 'Linear Learner' algorithm)

This model just produces a simple "company is in good health" or "company looks like it will go bust" binary prediction, based on one set of annual data fed in; e.g.

query input: {data:[{
"Gross Revenue": -4000,
"Balance Sheet": 10000,
"Creditors": 4000,
"Debts": 1000000 
}]}

inference output: "in good health" / "in bad health"

I trained this model by just ignoring what year for each company the values were from and pilling in all of the annual financial reports data (i.e. one years financial data for one company = one input line) for the training, along with the label of "good" or "bad" - a good company was one which has existed for a while, but hasn't gone bust, a bad company is one which was found to have eventually gone bust; e.g.:

label	Gross Revenue	Balance Sheet	Creditors	Debts
good	10000	20000	0	0
bad	0	5	100	10000
bad	20000	0	4	100000000

I hence used these multiple features (gross revenue, balance sheet...) along with the label (good/bad) in my training input, to create my first model.

I would like to use the same features as before as input (gross revenue, balance sheet..) but over multiple years; e.g take the values from 2020 & 2019 and use these (along with the eventual company status of "good" or "bad") as the singular input for my new model. However I'm unsure of the following:

is this an inappropriate use of logistic regression Machine learning? i.e. is there a more suitable algorithm I should consider?
is it fine, or terribly wrong to try and just use the same technique as before, but combine the data for both years into one input line like:

label	Gross Revenue(2019)	Balance Sheet(2019)	Creditors(2019)	Debts(2019)	Gross Revenue(2020)	Balance Sheet(2020)	Creditors(2020)	Debts(2020)
good	10000	20000	0	0	30000	10000	40	500
bad	100	50	200	50000	100	5	100	10000
bad	5000	0	2000	800000	2000	0	4	100000000

I would personally expect that a company which has gotten worse over time (i.e. companies finances are worse in 2020 than in 2019) should be more likely to be found to be a "bad"/likely to go bust, so I would hope that, if I feed in data like in the above example (i.e. earlier years data comes before later years data, on an input line) my training job ends up creating a model which gives greater weighting to the earlier years data, when making predictions

Any advice or tips would be greatly appreciated - I'm pretty new to machine learning and would like to learn more

UPDATE:

Using Long-Short-Term-Memory Recurrent Neural Networks (LSTM RNN) is one potential route I think I could try taking, but this seems to commonly just be used with multivariate data over many dates; my data only has 2 or 3 dates worth of multivariate data, per company. I would want to try using the data I have for all the companies, over the few dates worth of data there are, in training

Please notice that SO is about programming issues; questions on ML theory/methodology are off-topic here. Please notice the intro & NOTE in the machine-learning tag info. — desertnaut

Patrick Bormann Patrick Bormann · Accepted Answer · 2021-02-27T18:21:16

I once developed a so called Genetic Time Series in R. I used a Genetic Algorithm which sorted out the best solutions from multivariate data, which were fitted on a VAR in differences or a VECM. Your data seems more macro economic or financial than user-centric and VAR or VECM seems appropriate. (Surely it is possible to treat time-series data in the same way so that we can use LSTM or other approaches, but these are very common) However, I do not know if VAR in differences or VECM works with binary classified labels. Perhaps if you would calculate a metric outcome, which you later label encode to a categorical feature (or label it first to a categorical) than VAR or VECM may also be appropriate.

However you may add all yearly data points to one data points per firm to forecast its survival, but you would loose a lot of insight. If you are interested in time series ML which works a little bit different than for neural networks or elastic net (which could also be used with time series) let me know. And we can work something out. Or I'll paste you some sources.

Summary: 1.) It is possible to use LSTM, elastic NEt (time points may be dummies or treated as cross sectional panel) or you use VAR in differences and VECM with a slightly different out come variable

2.) It is possible but you will loose information over time.

All the best, Patrick

Using Logistic Regression For Timeseries Data in Amazon SageMaker

1 Answers