How expensive is SIFT in terms of memory?

Question

I'm trying to create the dataset of SIFT descriptors from the Oxford building dataset.

It's around 5k images and using the default with the largest size (width or height) of 1024pxs. Using the default VLFeat implementation, it generates on average 10k keypoints per image.

Now, let's do some math to compute the needed memory to store all the descriptors in memory:

10^3 (avg # keypoints per image) * 5 * 10^3 (images) * 128 (descriptor dimension) * 32 (float representation) / 8 (bytes) / 10^6 (GBs) = 2560 GBs

Wow, that's a lot of memory! :D So my question is pretty simple: did I write something wrong in the previous line, or we really need all this memory? It's weird because this is a really famous algorithm for object recognition. What's wrong?

hbaderts hbaderts · Accepted Answer · 2016-12-08T14:45:55

In this answer, I will first say something about how you can reduce the number of features in a single image to a more reasonable amount than 10k. Next I try to explain how you can get around saving all extracted features. Oh, and to answer your question: I think you made a mistake when converting bytes to gigabytes: 1 gigabyte is 1024^3 bytes, so I calculate ~2.4 GB.

1. Reducing the number of features

You probably won't need 10k SIFT features per image for a reliable matching. Many of the detected features will correspond to very small gradients, and most likely be noise. The documentation of the vl_sift function gives you two ways to control the number of descriptor points:

using a peak threshold PeakThresh. This is a threshold which is applied on the difference of gradient (DoG) image, so only points with a large peak in the DoG, i.e. large and distinct gradients, are considered as features.
using the edge threshold edgethresh. As we only want key points and not whole edges, these are removed. This process can be controlled with the edgethresh parameter, where a higher value gives you more features.

This helps you reduce the number of features to a reasonable amount, i.e. an amount you can handle (how much this is depends on your infrastructure, your needs, and your patience.). This does work for small-scale retrieval applications. For large-scale image retrieval, however, you might have thousands or millions of images. Even if you only had 10 descriptors per image, you still would have difficulties to handle this.

Clustering into visual words

To solve this problem, you run a k-means clustering on the extracted descriptors, which gives you a number k of so-called "Visual Words" (following the terminology of text retrieval). Next, you build an inverse index, which is exactly the same as a book index: you save, which word is mentioned on which page of the book - or here: which visual word appears in which image.

For a new image, all you have to do is find the nearest visual word for each descriptor, and use the inverse index to see which database image contains the same visual words as your query image.

That way, all you have to save is the centroids of your visual words. The number of visual words can be anything, though I'd say something in the range between 100 and 10k is often reasonable.

How expensive is SIFT in terms of memory?

1 Answers

1. Reducing the number of features

Clustering into visual words