2
votes

There is a bug in PyTorch/Numpy where when loading batches in parallel with a DataLoader (i.e. setting num_workers > 1), the same NumPy random seed is used for each worker, resulting in any random functions applied being identical across parallelized batches. This can be resolved by passing a seed generator to the worker_init_fn argument like so.

However the issue persists across multiple epochs.

Minimal example:

import numpy as np
from torch.utils.data import Dataset, DataLoader

class RandomDataset(Dataset):
    def __getitem__(self, index):
        return np.random.randint(0, 1000, 2)

    def __len__(self):
        return 4

dataset = RandomDataset()
dataloader = DataLoader(dataset, batch_size=1, 
                        num_workers=2, 
                        worker_init_fn = lambda x: np.random.seed(x))

for epoch in range(3):
    print(f'\nEpoch {epoch}')
    for batch in dataloader:
        print(batch)

As you can see, while parallelized batches within an epoch now produce different results, the results are identical across epochs:

Epoch 0
tensor([[684, 559]])
tensor([[ 37, 235]])
tensor([[629, 192]])
tensor([[908,  72]])

Epoch 1
tensor([[684, 559]])
tensor([[ 37, 235]])
tensor([[629, 192]])
tensor([[908,  72]])

Epoch 2
tensor([[684, 559]])
tensor([[ 37, 235]])
tensor([[629, 192]])
tensor([[908,  72]])

How can this be behaviour be fixed?


Using an empty argument e.g. worker_init_fn = lambda _: np.random.seed() appears to fix this - are there any issues with this workaround?

2

2 Answers

2
votes

As stated in the blog post you linked, what you did write will produce the same random number for each worker at each epoch:

Iterating over the dataset three times produces the same random numbers at each epoch. This happens because all changes to random states are local to each worker. By default, the worker processes are killed at the end of each epoch, and all worker resources are lost. At the same time, the random state in the main process hasn’t changed, and it’s used to initialize each worker process again.

The solution is given:

Therefore you need to change the NumPy’s seed at every epoch, for example by np.random.seed(initial_seed + epoch).

But I personally prefer to only use Torch random instead of Numpy's one to avoid issues as Torch handles randomness in parallel code by default.


Additional Note

According to the blog post:

PyTorch takes care of these by setting the [...] seeds to seed + worker_id automatically.

This means that using a Pytorch random function in your Dataset class or training loop should not replicate randomness across batches or epochs. For instance the minimal example you wrote could be fixed like this:

import torch
from torch.utils.data import Dataset, DataLoader

class RandomDataset(Dataset):
    def __getitem__(self, index):
        return torch.randint(0, 1000, (2,))

    def __len__(self):
        return 4

dataset = RandomDataset()
dataloader = DataLoader(dataset, batch_size=1, num_workers=2)

for epoch in range(3):
    print(f'\nEpoch {epoch}')
    for batch in dataloader:
        print(batch)

Most of the time Numpy can be replaced by Torch random (or Python random). here is another example with a random transformation for image segmentation:

class RandomHorizontalFlip:

    def __init__(self, prob=0.5):
        self.prob = prob

    def __call__(self, input, target):
        if torch.randn(1).item() < self.prob:
            return F.hflip(input), F.hflip(target)
        else:
            return input, target
1
votes

The best way I can think of is to use the seed set by pytorch for numpy and random:

import random
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader

def worker_init_fn(worker_id):
    torch_seed = torch.initial_seed()
    random.seed(torch_seed + worker_id)
    if torch_seed >= 2**30:  # make sure torch_seed + workder_id < 2**32
        torch_seed = torch_seed % 2**30
    np.random.seed(torch_seed + worker_id)

class RandomDataset(Dataset):
    def __getitem__(self, index):
        return np.random.randint(0, 1000, 2)

    def __len__(self):
        return 4

dataset = RandomDataset()
dataloader = DataLoader(dataset, batch_size=1, 
                        num_workers=2, 
                        worker_init_fn = worker_init_fn)

for epoch in range(3):
    print(f'\nEpoch {epoch}')
    for batch in dataloader:
        print(batch)

Output:

Epoch 0
tensor([[593, 191]])
tensor([[207, 469]])
tensor([[976, 714]])
tensor([[ 13, 119]])

Epoch 1
tensor([[836, 664]])
tensor([[138, 836]])
tensor([[409, 313]])
tensor([[  2, 221]])

Epoch 2
tensor([[269, 888]])
tensor([[315, 619]])
tensor([[892, 774]])
tensor([[ 70, 771]])

Alternatively, you can use int(time.time()) to seed numpy and random, assuming each epoch takes more than 1 second to run.