0
votes

I'm working on a project that requires training a PyTorch framework NN on a very large dataset of images. Some of these images are completely irrelevant to the problem, and but these irrelevant images are not labelled as such. However, there are some metrics I can use to calculate if they are irrelevant (e.g. summing all the pixel values would give me a good sense of which are the relevant images and which are not). What I would ideally like to do is have a Dataloader that can take in a Dataset class, and create batches only with the relevant images. The Dataset class would just know the list of images and their labels, and the Dataloader would interpret whether or not the image it is making a batch with is relevant or not, and would then only make batches with relevant images.

To apply this to an example, lets say I have a dataset of black and white images. The white images are irrelevant, but they are not labelled as such. I want to be able to load batches from a file location, and have these batches only contain the black images. I could filter at some point by summing all the pixels and finding it equals to 0.

What I am wondering is if a custom Dataset, Dataloader, or Sampler would be able to solve this task for me? I already have written a custom Dataset that stores the directory of all the saved images, and a list of all the images in that directory, and can return an image with its label in the getitem function. Is there something more I should add there to filter out certain images? Or should that filter be applied in a custom Dataloader, or Sampler?

Thank you!

2
Dataloader or sampler just samples a random index from your dataset. I would suggest that you change getitem method inside your custom dataset class to add this functionality. But, you have to make sure that you send a valid item each time i.e. if the index sent by dataloader contains invalid image you have to send another valid image.PranayModukuru
@PranayModukuru Thanks. So you're suggesting that the getitem method should include the functionality of not returning the correct item at an index if it is not valid, and instead returning a valid one?Tate Andrew Keller
Yes, as suggested in the answers it is better to do once before training rather than doing it in every step. To reduce compute.PranayModukuru

2 Answers

0
votes

I'm assuming that your image dataset belongs to two classes (0 or 1) but it's unlabeled. As @PranayModukuru mentioned that you can determine the similarity by using some measure (e.g aggregating all the pixels intensity values of a image, as you mentioned) in the getitem function in tour custom Dataset class.

However, determining the similarity in getitem function while training your model will make the training process very slow. So, i would recommend you to approximate the similarity before start training (not in the getitem function). Moreover if your image dataset is comprised of complex images (not black and white images) it's better to use a pretrained deep learning model (e.g. resnet or autoencoder) for dimentionality reduction followed by applying clustering approach (e.g. agglomerative clustering) to label your image.

In the second approach you only need to label your images for exactly one time and if you apply augmentation on images while training you don't need to re-determine the similarity (label) in the getitem funcion. On the other hand, in the first approach you need to determine the similarity (label) every time (after applying transformation on images) in the getitem function which is redundant, unnecessary and time consuming.

Hope this will help.

0
votes

It sounds like your goal is to totally remove the irrelevant images from training.

The best way to deal with this would be to figure out the filenames of all the relevant images up front and save their filenames to a csv or something. Then pass only the good filenames to your dataset.

The reason is you will run through your dataset multiple times during training. This means you will be loading, analyzing and discarding irrelevant images over and over again, which is a waste of compute.

It's better to do this sort of preprocessing/filtering once upfront.