When I create a PyTorch DataLoader and start iterating -- I get an extremely slow first epoch (x10--x30 slower then all next epochs). Moreover, this problem occurs only with the train dataset from the Google landmark recognition 2020 from Kaggle. I can't reproduce this on synthetic images, also, I tried to create a folder with 500k images from GLR2020, and everything worked well. Found few similar problems in the PyTorch forum without any solutions.
import argparse
import pandas as pd
import numpy as np
import os, sys
import multiprocessing, ray
import time
import cv2
import logging
import albumentations as albu
from torch.utils.data import Dataset, DataLoader
samples = 50000 # count of samples to speed up test
bs = 64 # batch size
dir = '/hdd0/datasets/ggl_landmark_recognition_2020/train' # directory with train data
all_files = pd.read_csv('/hdd0/datasets/ggl_landmark_recognition_2020/train.csv')
files = np.random.choice(all_files.id.values, 50000)
files = [os.path.join(_[0], _[1], _[2], _+'.jpg') for _ in files]
# augmentations
aug = albu.Compose([albu.Resize(400, 400),
albu.Rotate(limit=15),
albu.ChannelDropout(p=0.1),
albu.Normalize(),])
class ImgDataset:
def __init__(self, path, files, augmentation = None):
self.path = path
self.files = {k:v for k, v in enumerate(files)}
self.augmentation = augmentation
def __len__(self):
return len(self.files)
def __getitem__(self, idx):
img_name = self.files[idx]
img = np.array(cv2.imread(os.path.join(self.path, img_name)))
if self.augmentation is not None:
return self.augmentation(image=img)['image']
dtset = ImgDataset(dir,files, aug)
torchloader = DataLoader(dataset= dtset, batch_size=64, num_worker=16, shuffle=True)
for _ in range(3):
t1 = time.time()
for idx, val in enumerate(torchloader):
pass
t2 = time.time()
print(str(t2-t1) +' sec')
Here are some examples of execution speed with different num_workers
in DataLoader
#num_workers=0
273.1584792137146 sec
83.15653467178345 sec
83.67923021316528 sec
# num_workers = 8
165.62366938591003 sec
10.405716896057129 sec
10.495309114456177 sec
# num_workers = 16
156.60744667053223 sec
8.051618099212646 sec
7.922858238220215 sec
Looks like the problem is not with DataLoader, but with dataset. When I delete and reinitialise DataLoader object after first "long" iteration, everything still works fine. When I reinitialise dataset -- long first iteration appears again.
Moreover, I tracked my cpu utilisation via htop
during this epochs with num_workers
setted to 32, and during the first epoch, utilisation is really low; only 1-2 of 32 cores are working, during other epochs ~all cores are working.
self.files = {k:v for k, v in enumerate(files)}
takes? – hkchengrexsync; echo 3 > /proc/sys/vm/drop_caches
(on Ubuntu) after completing the first epoch? (tecmint.com/… says that running this won't wreck any running process) – Multihunter