3
votes

I am working on a project where I deal with:

  • 70,000 JPG images totalling 1 GB
  • Each files is ~ 15kb.
  • Each image is 424x424.

My current solution for working with these files is to take each image, crop it to 150x150 and then saving it in a NumPy memmap array. I end up with 1 large memmap array file with dimensions 70,000 x 150 x 150 x 3 (coloured images).

My next step is to loop through this memmap array and randomly sample patches of image. However, my code is running very slowly at the moment and most annoyingly, it only uses about 10% of CPU with a HD read speed of 1-5 MB / sec. This is probably even lower than not pre-computing the cropped numpy memmap array and reading the JPG everytime.

What can I do to make better use of my system resources here?

System Information

  • Mac OS X
  • Macbook Pro with HDD

Thanks!

1
If you need just random subset of your images why don't you select randomly original files and process just them? Additionally you can start a background thread which will resize your original images to the 150x150 size.Andrei Boyanov
What OS do you use? I guess your OS issues random I/O on the memmap backend file.nodakai
@nodakai Edited with the extra infomchangun

1 Answers

0
votes

Firstly, @AndreiBoyanov's comment really sounds reasonable to me.

Here's another approach.

>>> 7e4 * 150**2 * 3 / 1024.**3
4.400499165058136

The backend file of memmap will grow into 4.4 GB. If your OS X machine has more RAM, you can create the backend file on a RAM disk of, say, 5 GB:

It almost amounts to throwing away memmap, but it will be a quick fix for your problem.