Pytorch:Network not learning at all + Weights are too low

Question

About the input. Sorry for the bad formatting. The for each two rows first row is the key and second row is the value. 18~20_ride is the label and is not included in the input. Below is one input. And train set consists of 400000 of these.

bus_route_id    station_code    latitude    longitude   6~7_ride    
0               4270000         344         33.48990    126.49373
7~8_ride    8~9_ride    9~10_ride   10~11_ride  11~12_ride  6~7_takeoff  
0.0         1.0         2.0         5.0         2.0         6.0
7~8_takeoff 8~9_takeoff 9~10_takeoff    10~11_takeoff   11~12_takeoff    
0.0         0.0         0.0             0.0             0.0 
18~20_ride  weekday dis_jejusi  dis_seoquipo            
0.0         6       2.954920    26.256744

Example weights: Captured at 4th epoch. After 20 epochs of training I got much smaller values (ex. -7e-44 or 1e-55)

 2.3937e-11, -2.6920e-12, -1.0445e-11,  ..., -1.0754e-11, 1.1128e-11, -1.4814e-11

The model's prediction and target

#Target
[2.],
[0.],
[0.]

#Prediction
[1.4187],
[1.4187],
[1.4187]

MyDataset.py

from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd
import torch
import os

class MyDataset(Dataset):
  def __init__(self, csv_filename):
    self.dataset = pd.read_csv(csv_filename, index_col=0)
    self.labels = self.dataset.pop("18~20_ride")
    self.dataset = self.dataset.values
    self.labels = np.reshape(self.labels.values,(-1,1))

  def __len__(self):
    return len(self.dataset)

  def __getitem__(self, idx):
    return self.dataset[idx], self.labels[idx]

Model

class Network(nn.Module):
    def __init__(self, input_num):
        super(Network, self).__init__()
        self.fc1 = nn.Sequential(
          nn.Linear(input_num, 64),
          nn.BatchNorm1d(64),
          GELU()
        )

        self.fc2 = nn.Sequential(
          nn.Linear(64, 64),
          nn.BatchNorm1d(64),
          GELU()
        )
        self.fc3 = nn.Sequential(
          nn.Linear(64, 64),
          nn.BatchNorm1d(64),
          GELU()
        )
        self.fc4 = nn.Sequential(
          nn.Linear(64, 64),
          nn.BatchNorm1d(64),
          GELU()
        )
        self.fc5 = nn.Sequential(
          nn.Linear(64, 64),
          nn.BatchNorm1d(64),
          GELU()
        )
        self.fc6 = nn.Sequential(
          nn.Linear(64, 64),
          nn.BatchNorm1d(64),
          GELU)
        )
        self.fc7 = nn.Sequential(
          nn.Linear(64, 64),
          nn.BatchNorm1d(64),
          GELU()
        )
        self.fc8 = nn.Sequential(
          nn.Linear(64, 64),
          nn.BatchNorm1d(64),
          GELU())
        )
        self.fc9 = nn.Linear(64, 1)

The training and validation

def train(model, device, train_loader, optimizer, loss_fn, log_interval, epoch):
  print("Training")
  model.train()
  for batch_idx, (data, target) in enumerate(train_loader):
    data, target = data.float().to(device), target.float().to(device)
    optimizer.zero_grad()
    output = model(data)
    loss = loss_fn(output, target)
    loss.backward()
    optimizer.step()
    if batch_idx % log_interval == 0:
        print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
            epoch, (batch_idx+1) * len(data), len(train_loader.dataset),
            100. * batch_idx / len(train_loader), loss.item()))

def validate(model, device, loader, loss_fn):
  print("\nValidating")
  model.eval()
  test_loss = 0
  correct = 0
  with torch.no_grad():
    for batch_idx, (data, target) in enumerate(loader):
      data, target = data.float().to(device), target.float().to(device)
      output = model(data)
      test_loss += loss_fn(output, target).item()  # sum up batch loss

  test_loss /= len(loader)

  print('Validation average loss: {:.4f}\n'.format(
      test_loss))
  return test_loss

Entire process of training and validation

from MyDataset import MyDataset
from torch.utils.data import DataLoader
from torch.optim.lr_scheduler import StepLR
from datetime import datetime

train_dataset_path = "/content/drive/My Drive/root/bus/dataset/train.csv"
val_dataset_path = "/content/drive/My Drive/root/bus/dataset/val.csv"
model_base_path = "/content/drive/My Drive/root/bus/models/"

model_file = "/content/drive/My Drive/root/bus/models/checkpoints/1574427776.202017.pt"

"""
Training Config
"""
epochs = 10
batch_size = 32
learning_rate = 0.5

check_interval = 4

log_interval = int(40000/batch_size)
gamma = 0.1

load_model = False
save_model = True
make_checkpoint = True
"""
End of config
"""

# Read test set
train_set = MyDataset(train_dataset_path)
train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
val_set = MyDataset(val_dataset_path)
val_loader = DataLoader(val_set, batch_size=1)
print("Data READY")

device = torch.device("cuda")
net = Network(19).float().to(device)
if load_model:
  net.load_state_dict(torch.load(model_file))
loss_fn = torch.nn.MSELoss()
optimizer = optim.AdamW(net.parameters(), lr=learning_rate)

best_loss = float('inf')
isAbort = False
for epoch in range(1, epochs+1):
  train(net, device, train_loader, optimizer, loss_fn, log_interval, epoch)
  val_loss = validate(net, device, val_loader, loss_fn)
  if epoch%check_interval==0:
    if make_checkpoint:
      print("Saving new checkpoint")
      torch.save(net.state_dict(), model_base_path+"checkpoints/"+str(datetime.today().timestamp())+".pt")
      """
  if val_loss < best_loss and epoch%check_interval==0:
    best_loss = val_loss
    if make_checkpoint:
      print("Saving new checkpoint")
      torch.save(net.state_dict(), model_base_path+"checkpoints/"+str(datetime.today().timestamp())+".pt")
  else:
    print("Model overfit detected. Aborting training")
    isAbort = True
    break
    """
if save_model and not isAbort:
    torch.save(net.state_dict(), model_base_path+"finals/"+str(datetime.today().timestamp())+".pt")

So I tried to train a fully connected model for a regression problem, with google colab. But it did not get trained well; The loss absolutely did not decrease. So I dug down and found out that the weights were really small. Any idea why this is happening and how I could avoid this? Thank you I used MSE for loss and used ADaW optimizer. Below are the things I have tried

Tried other architectures (Changing number of layers sizes, Changed activation function ReLU, GELU)but the loss did not decrease
Tried changing the learning rate from 3e-1~1e-3, even tried 1
Tried other pre-processing(Used day/month/year instead of weekday) for the data
Given the label in the input data but loss did not decrease
Tried different batch_sizes(4, 10, 32, 64)
Removed batch_normalization
Other kinds of optimizer such as SGD, Adam
Training 20 epochs but loss did not decrease at all
The weights do change at loss.backward()

This is quite unexpected. Could you please share some more details. For instance, when did you check these weights? Right before starting the training or in between the training? These details would help us to narrow down your problem. — Shagun Sodhani
@ShagunSodhani The loss did not decrease so I decided to stop the training and check out what is going on. The weights were captured after 4 epochs of training. — Inyoung Kim 김인영
How many data samples do you have? What is the distribution of your classes? Have you trained longer than just 4 epochs? What does "other data pre-processing" include, and what are the current steps? What different architectures have you tried? Please take a look at minimal reproducible example and include all necessary information, including data samples. — dennlinger

Inyoung Kim 김인영 Inyoung Kim 김인영 · Accepted Answer · 2020-01-27T06:43:47

TL;DR: Invalid input data!! Check for NaN or NULL

Well it has been sometime since the question. Tried almost everything and though maybe messed up the project setup. So I deleted the project and tried it again: same. Delete again and migrate to TF2: THE SAME RESULT! So I found out that there wasn't any problem with the setup. So I searched other places. In the end I did find the reason. The input columns were actually modified by myself. (To remove some highly correlated features). It was not original. During the modification I messed up some float values and it ended up having NaN values. So check if you're dataset contains invalid values.

Pytorch:Network not learning at all + Weights are too low

1 Answers