For a project I am working on, which uses annual financial reports data (of multiple categories) from companies which have been successful or gone bust/into liquidation, I previously created a (fairly well performing) model on AWS Sagemaker using a multiple linear regression algorithm (specifically, the AWS stock algorithm for logistic regression/classification problems - the 'Linear Learner' algorithm)
This model just produces a simple "company is in good health" or "company looks like it will go bust" binary prediction, based on one set of annual data fed in; e.g.
query input: {data:[{
"Gross Revenue": -4000,
"Balance Sheet": 10000,
"Creditors": 4000,
"Debts": 1000000
}]}
inference output: "in good health" / "in bad health"
I trained this model by just ignoring what year for each company the values were from and pilling in all of the annual financial reports data (i.e. one years financial data for one company = one input line) for the training, along with the label of "good" or "bad" - a good company was one which has existed for a while, but hasn't gone bust, a bad company is one which was found to have eventually gone bust; e.g.:
label | Gross Revenue | Balance Sheet | Creditors | Debts |
---|---|---|---|---|
good | 10000 | 20000 | 0 | 0 |
bad | 0 | 5 | 100 | 10000 |
bad | 20000 | 0 | 4 | 100000000 |
I hence used these multiple features (gross revenue, balance sheet...) along with the label (good/bad) in my training input, to create my first model.
I would like to use the same features as before as input (gross revenue, balance sheet..) but over multiple years; e.g take the values from 2020 & 2019 and use these (along with the eventual company status of "good" or "bad") as the singular input for my new model. However I'm unsure of the following:
- is this an inappropriate use of logistic regression Machine learning? i.e. is there a more suitable algorithm I should consider?
- is it fine, or terribly wrong to try and just use the same technique as before, but combine the data for both years into one input line like:
label | Gross Revenue(2019) | Balance Sheet(2019) | Creditors(2019) | Debts(2019) | Gross Revenue(2020) | Balance Sheet(2020) | Creditors(2020) | Debts(2020) |
---|---|---|---|---|---|---|---|---|
good | 10000 | 20000 | 0 | 0 | 30000 | 10000 | 40 | 500 |
bad | 100 | 50 | 200 | 50000 | 100 | 5 | 100 | 10000 |
bad | 5000 | 0 | 2000 | 800000 | 2000 | 0 | 4 | 100000000 |
I would personally expect that a company which has gotten worse over time (i.e. companies finances are worse in 2020 than in 2019) should be more likely to be found to be a "bad"/likely to go bust, so I would hope that, if I feed in data like in the above example (i.e. earlier years data comes before later years data, on an input line) my training job ends up creating a model which gives greater weighting to the earlier years data, when making predictions
Any advice or tips would be greatly appreciated - I'm pretty new to machine learning and would like to learn more
UPDATE:
Using Long-Short-Term-Memory Recurrent Neural Networks (LSTM RNN) is one potential route I think I could try taking, but this seems to commonly just be used with multivariate data over many dates; my data only has 2 or 3 dates worth of multivariate data, per company. I would want to try using the data I have for all the companies, over the few dates worth of data there are, in training
machine-learning
tag info. – desertnaut