I want to run some experiments with neural networks using PyTorch, so I tried a simple one as a warm-up exercise, and I cannot quite make sense of the results.
The exercise attempts to predict the rating of 1000 TPTP problems from various statistics about the problems such as number of variables, maximum clause length etc. Data file https://github.com/russellw/ml/blob/master/test.csv is quite straightforward, 1000 rows, the final column is the rating, started off with some tens of input columns, with all the numbers scaled to the range 0-1, I progressively deleted features to see if the result still held, and it does, all the way down to one input column; the others are in previous versions in Git history.
I started off using separate training and test sets, but have set aside the test set for the moment, because the question about whether training performance generalizes to testing, doesn't arise until training performance has been obtained in the first place.
Simple linear regression on this data set has a mean squared error of about 0.14.
I implemented a simple feedforward neural network, code in https://github.com/russellw/ml/blob/master/test_nn.py and copied below, that after a couple hundred training epochs, also has an mean squared error of 0.14.
So I tried changing the number of hidden layers from 1 to 2 to 3, using a few different optimizers, tweaking the learning rate, switching the activation functions from relu to tanh to a mixture of both, increasing the number of epochs to 5000, increasing the number of hidden units to 1000. At this point, it should easily have had the ability to just memorize the entire data set. (At this point I'm not concerned about overfitting. I'm just trying to get the mean squared error on training data to be something other than 0.14.) Nothing made any difference. Still 0.14. I would say it must be stuck in a local optimum, but that's not supposed to happen when you've got a couple million weights; it's supposed to be practically impossible to be in a local optimum for all parameters simultaneously. And I do get slightly different sequences of numbers on each run. But it always converges to 0.14.
Now the obvious conclusion would be that 0.14 is as good as it gets for this problem, except that it stays the same even when the network has enough memory to just memorize all the data. But the clincher is that I also tried a random forest, https://github.com/russellw/ml/blob/master/test_rf.py
... and the random forest has a mean squared error of 0.01 on the original data set, degrading gracefully as features are deleted, still 0.05 on the data with just one feature.
Nowhere in the lore of machine learning is it said 'random forests vastly outperform neural nets', so I'm presumably doing something wrong, but I can't see what it is. Maybe it's something as simple as just missing a flag or something you need to set in PyTorch. I would appreciate it if someone could take a look.
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
# data
df = pd.read_csv("test.csv")
print(df)
print()
# separate the output column
y_name = df.columns[-1]
y_df = df[y_name]
X_df = df.drop(y_name, axis=1)
# numpy arrays
X_ar = np.array(X_df, dtype=np.float32)
y_ar = np.array(y_df, dtype=np.float32)
# torch tensors
X_tensor = torch.from_numpy(X_ar)
y_tensor = torch.from_numpy(y_ar)
# hyperparameters
in_features = X_ar.shape[1]
hidden_size = 100
out_features = 1
epochs = 500
# model
class Net(nn.Module):
def __init__(self, hidden_size):
super(Net, self).__init__()
self.L0 = nn.Linear(in_features, hidden_size)
self.N0 = nn.ReLU()
self.L1 = nn.Linear(hidden_size, hidden_size)
self.N1 = nn.Tanh()
self.L2 = nn.Linear(hidden_size, hidden_size)
self.N2 = nn.ReLU()
self.L3 = nn.Linear(hidden_size, 1)
def forward(self, x):
x = self.L0(x)
x = self.N0(x)
x = self.L1(x)
x = self.N1(x)
x = self.L2(x)
x = self.N2(x)
x = self.L3(x)
return x
model = Net(hidden_size)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
# train
print("training")
for epoch in range(1, epochs + 1):
# forward
output = model(X_tensor)
cost = criterion(output, y_tensor)
# backward
optimizer.zero_grad()
cost.backward()
optimizer.step()
# print progress
if epoch % (epochs // 10) == 0:
print(f"{epoch:6d} {cost.item():10f}")
print()
output = model(X_tensor)
cost = criterion(output, y_tensor)
print("mean squared error:", cost.item())