11
votes

When I create a PyTorch DataLoader and start iterating -- I get an extremely slow first epoch (x10--x30 slower then all next epochs). Moreover, this problem occurs only with the train dataset from the Google landmark recognition 2020 from Kaggle. I can't reproduce this on synthetic images, also, I tried to create a folder with 500k images from GLR2020, and everything worked well. Found few similar problems in the PyTorch forum without any solutions.

import argparse
import pandas as pd
import numpy as np
import os, sys
import multiprocessing, ray
import time
import cv2
import logging
import albumentations as albu
from torch.utils.data import Dataset, DataLoader

samples = 50000 # count of samples to speed up test
bs = 64 # batch size
dir = '/hdd0/datasets/ggl_landmark_recognition_2020/train' # directory with train data
all_files = pd.read_csv('/hdd0/datasets/ggl_landmark_recognition_2020/train.csv')
files = np.random.choice(all_files.id.values, 50000)
files = [os.path.join(_[0], _[1], _[2], _+'.jpg') for _ in files]

# augmentations
aug =  albu.Compose([albu.Resize(400, 400),
        albu.Rotate(limit=15),
        albu.ChannelDropout(p=0.1),
        albu.Normalize(),])

class ImgDataset:
    def __init__(self, path, files, augmentation = None):
        self.path = path
        self.files = {k:v for k, v in enumerate(files)}
        self.augmentation = augmentation

    def __len__(self):
        return len(self.files)

    def __getitem__(self, idx):
        img_name = self.files[idx]
        img = np.array(cv2.imread(os.path.join(self.path, img_name)))
        if self.augmentation is not None:
            return self.augmentation(image=img)['image']


dtset = ImgDataset(dir,files, aug)
torchloader = DataLoader(dataset= dtset, batch_size=64, num_worker=16, shuffle=True)
for _ in range(3):
   t1 = time.time()
   for idx, val in enumerate(torchloader):
       pass
   t2 = time.time()
   print(str(t2-t1) +' sec')

Here are some examples of execution speed with different num_workers in DataLoader

#num_workers=0
273.1584792137146 sec
83.15653467178345 sec
83.67923021316528 sec

# num_workers = 8 
165.62366938591003 sec
10.405716896057129 sec
10.495309114456177 sec

# num_workers = 16
156.60744667053223 sec
8.051618099212646 sec
7.922858238220215 sec

Looks like the problem is not with DataLoader, but with dataset. When I delete and reinitialise DataLoader object after first "long" iteration, everything still works fine. When I reinitialise dataset -- long first iteration appears again. Moreover, I tracked my cpu utilisation via htop during this epochs with num_workers setted to 32, and during the first epoch, utilisation is really low; only 1-2 of 32 cores are working, during other epochs ~all cores are working.

2
Maybe you can check how long self.files = {k:v for k, v in enumerate(files)} takes?hkchengrex
@hkchengrex checked, ofc. This line in init method -> it takes time not during iteration, but during creating class instance.Slavka
I have observed a similar situation with my own datasets (although not as pronounced a difference); I've chalked it up to the operating system caching data in RAM which makes subsequent reads faster. What happens if you clear the cached RAM with sync; echo 3 > /proc/sys/vm/drop_caches (on Ubuntu) after completing the first epoch? (tecmint.com/… says that running this won't wreck any running process)Multihunter
The fact that the CPU utilisation is low for the first epoch tells us that it is almost certainly to do with disk IO operations. The question is what is it that is happening. Can you describe your hardware set-up? Is your data on a HDD while your operating system is on an SSD? It's not pointing to a drive on the local network or something is it?Multihunter

2 Answers

11
votes

Slavka,

I did not download the whole GLR2020 dataset but I was able to observe this effect on the image dataset that I had locally (80000 jpg images of approx 400x400 size).

To find the reasons for the difference in performance I tried the following:

  1. reducing the augmentation to just resizing
  2. testing just ImgDataset.__getitem__() function
  3. ImgDataset.__getitem__() without augmentation
  4. just loading the raw jpg image and passing it from the dataset without even numpy conversion.

It turns out that the difference comes from the image loading timing. Python (or OS itself) implements some kind of caching which is observed when loading image multiple times in the following test.

for i in range(5):    
    t0 = time.time()
    data = cv2.imread(filename)
    print (time.time() - t0)
    
0.03395271301269531
0.0010004043579101562
0.0010004043579101562
0.0010008811950683594
0.001001119613647461

same is observed when just reading from file to variable

for i in range(5):    
    t0 = time.time()
    with open(filename, mode='rb') as file: 
        data = file.read()
    print (time.time() - t0)

0.036234378814697266
0.0028831958770751953
0.0020024776458740234
0.0031833648681640625
0.0028734207153320312

One way to reduce the loading speed is to keep the data on very fast local SSD. If size allows, try loading part of the dataset into RAM and writing custom dataloader to feed from there...

BTW Based on my findings this effect should be reproducible with any dataset - see if you used different drives or some caching.

2
votes

It appears that the OS is caching IO access to the dataset. To check if this is definitely the problem, try running sync; echo 3 > /proc/sys/vm/drop_caches (on Ubuntu) after the first epoch. If the second epoch is equally slow when you do this, then it is the caching which is making the subsequent reads so much faster.

If you are using a HDD, then you may get significant speed improvements for your first epoch by co-locating all of your small image files on disk.

You can use SquashFS (it comes pre-installed with Ubuntu) to compress your whole dataset into single file, then mount that file as a directory and access it just as you were before (except now the images are co-located on disk). The mounted directory is read-only.

e.g.

mksquashfs /path/to/data data.sqsh
mount data.sqsh /path/to/data_sqsh -t squashfs -o loop

Then you can use /path/to/data_sqsh in precisely the same way you used /path/to/data. You will have to re-mount it when you restart your computer

See: https://tldp.org/HOWTO/SquashFS-HOWTO/creatingandusing.html