In this answer, I will first say something about how you can reduce the number of features in a single image to a more reasonable amount than 10k. Next I try to explain how you can get around saving all extracted features. Oh, and to answer your question: I think you made a mistake when converting bytes to gigabytes: 1 gigabyte is 1024^3 bytes, so I calculate ~2.4 GB.
1. Reducing the number of features
You probably won't need 10k SIFT features per image for a reliable matching. Many of the detected features will correspond to very small gradients, and most likely be noise. The documentation of the vl_sift function gives you two ways to control the number of descriptor points:
- using a peak threshold
PeakThresh. This is a threshold which is applied on the difference of gradient (DoG) image, so only points with a large peak in the DoG, i.e. large and distinct gradients, are considered as features.
- using the edge threshold
edgethresh. As we only want key points and not whole edges, these are removed. This process can be controlled with the edgethresh parameter, where a higher value gives you more features.
This helps you reduce the number of features to a reasonable amount, i.e. an amount you can handle (how much this is depends on your infrastructure, your needs, and your patience.). This does work for small-scale retrieval applications. For large-scale image retrieval, however, you might have thousands or millions of images. Even if you only had 10 descriptors per image, you still would have difficulties to handle this.
Clustering into visual words
To solve this problem, you run a k-means clustering on the extracted descriptors, which gives you a number k of so-called "Visual Words" (following the terminology of text retrieval). Next, you build an inverse index, which is exactly the same as a book index: you save, which word is mentioned on which page of the book - or here: which visual word appears in which image.
For a new image, all you have to do is find the nearest visual word for each descriptor, and use the inverse index to see which database image contains the same visual words as your query image.
That way, all you have to save is the centroids of your visual words. The number of visual words can be anything, though I'd say something in the range between 100 and 10k is often reasonable.