Neural network versus random forest performance discrepancy

Question

I want to run some experiments with neural networks using PyTorch, so I tried a simple one as a warm-up exercise, and I cannot quite make sense of the results.

The exercise attempts to predict the rating of 1000 TPTP problems from various statistics about the problems such as number of variables, maximum clause length etc. Data file https://github.com/russellw/ml/blob/master/test.csv is quite straightforward, 1000 rows, the final column is the rating, started off with some tens of input columns, with all the numbers scaled to the range 0-1, I progressively deleted features to see if the result still held, and it does, all the way down to one input column; the others are in previous versions in Git history.

I started off using separate training and test sets, but have set aside the test set for the moment, because the question about whether training performance generalizes to testing, doesn't arise until training performance has been obtained in the first place.

Simple linear regression on this data set has a mean squared error of about 0.14.

I implemented a simple feedforward neural network, code in https://github.com/russellw/ml/blob/master/test_nn.py and copied below, that after a couple hundred training epochs, also has an mean squared error of 0.14.

So I tried changing the number of hidden layers from 1 to 2 to 3, using a few different optimizers, tweaking the learning rate, switching the activation functions from relu to tanh to a mixture of both, increasing the number of epochs to 5000, increasing the number of hidden units to 1000. At this point, it should easily have had the ability to just memorize the entire data set. (At this point I'm not concerned about overfitting. I'm just trying to get the mean squared error on training data to be something other than 0.14.) Nothing made any difference. Still 0.14. I would say it must be stuck in a local optimum, but that's not supposed to happen when you've got a couple million weights; it's supposed to be practically impossible to be in a local optimum for all parameters simultaneously. And I do get slightly different sequences of numbers on each run. But it always converges to 0.14.

Now the obvious conclusion would be that 0.14 is as good as it gets for this problem, except that it stays the same even when the network has enough memory to just memorize all the data. But the clincher is that I also tried a random forest, https://github.com/russellw/ml/blob/master/test_rf.py

... and the random forest has a mean squared error of 0.01 on the original data set, degrading gracefully as features are deleted, still 0.05 on the data with just one feature.

Nowhere in the lore of machine learning is it said 'random forests vastly outperform neural nets', so I'm presumably doing something wrong, but I can't see what it is. Maybe it's something as simple as just missing a flag or something you need to set in PyTorch. I would appreciate it if someone could take a look.

import numpy as np
import pandas as pd
import torch
import torch.nn as nn

# data
df = pd.read_csv("test.csv")
print(df)
print()

# separate the output column
y_name = df.columns[-1]
y_df = df[y_name]
X_df = df.drop(y_name, axis=1)

# numpy arrays
X_ar = np.array(X_df, dtype=np.float32)
y_ar = np.array(y_df, dtype=np.float32)

# torch tensors
X_tensor = torch.from_numpy(X_ar)
y_tensor = torch.from_numpy(y_ar)

# hyperparameters
in_features = X_ar.shape[1]
hidden_size = 100
out_features = 1
epochs = 500

# model
class Net(nn.Module):
    def __init__(self, hidden_size):
        super(Net, self).__init__()
        self.L0 = nn.Linear(in_features, hidden_size)
        self.N0 = nn.ReLU()
        self.L1 = nn.Linear(hidden_size, hidden_size)
        self.N1 = nn.Tanh()
        self.L2 = nn.Linear(hidden_size, hidden_size)
        self.N2 = nn.ReLU()
        self.L3 = nn.Linear(hidden_size, 1)

    def forward(self, x):
        x = self.L0(x)
        x = self.N0(x)
        x = self.L1(x)
        x = self.N1(x)
        x = self.L2(x)
        x = self.N2(x)
        x = self.L3(x)
        return x


model = Net(hidden_size)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)

# train
print("training")
for epoch in range(1, epochs + 1):
    # forward
    output = model(X_tensor)
    cost = criterion(output, y_tensor)

    # backward
    optimizer.zero_grad()
    cost.backward()
    optimizer.step()

    # print progress
    if epoch % (epochs // 10) == 0:
        print(f"{epoch:6d} {cost.item():10f}")
print()

output = model(X_tensor)
cost = criterion(output, y_tensor)
print("mean squared error:", cost.item())

Random Forest can outperform neural nets! of course it can. even simple regression can outperform neural nets. if you don't have large dataset or many features than regression with random forest would certainly outperform neural nets — basilisk

basilisk basilisk · Accepted Answer · 2019-12-11T16:13:39

can you please print the shape of your input ? I would say check those things first:

that your target y have the shape (-1, 1) I don't know if pytorch throws an Error in this case. you can use y.reshape(-1, 1) if it isn't 2 dim
your learning rate is high. usually when using Adam the default value is good enough or try simply to lower your learning rate. 0.1 is a high value for a learning rate to start with
place the optimizer.zero_grad at the first line inside the for loop
normalize/standardize your data ( this is usually good for NNs )
remove outliers in your data (my opinion: I think this can't affect Random forest so much but it can affect NNs badly)
use cross validation (maybe skorch can help you here. It's a scikit learn wrapper for pytorch and easy to use if you know keras)

Notice that Random forest regressor or any other regressor can outperform neural nets in some cases. There is some fields where neural nets are the heros like Image Classification or NLP but you need to be aware that a simple regression algorithm can outperform them. Usually when your data is not big enough.

Neural network versus random forest performance discrepancy

1 Answers