I'm working on a project that requires training a PyTorch framework NN on a very large dataset of images. Some of these images are completely irrelevant to the problem, and but these irrelevant images are not labelled as such. However, there are some metrics I can use to calculate if they are irrelevant (e.g. summing all the pixel values would give me a good sense of which are the relevant images and which are not). What I would ideally like to do is have a Dataloader that can take in a Dataset class, and create batches only with the relevant images. The Dataset class would just know the list of images and their labels, and the Dataloader would interpret whether or not the image it is making a batch with is relevant or not, and would then only make batches with relevant images.
To apply this to an example, lets say I have a dataset of black and white images. The white images are irrelevant, but they are not labelled as such. I want to be able to load batches from a file location, and have these batches only contain the black images. I could filter at some point by summing all the pixels and finding it equals to 0.
What I am wondering is if a custom Dataset, Dataloader, or Sampler would be able to solve this task for me? I already have written a custom Dataset that stores the directory of all the saved images, and a list of all the images in that directory, and can return an image with its label in the getitem function. Is there something more I should add there to filter out certain images? Or should that filter be applied in a custom Dataloader, or Sampler?
Thank you!