2
votes

Assumptions:

  • The usernames of registered users are stored in a set
  • I want to use a Bloom filter to make lookups faster.
  • The Bloom filter as a certain probability of false-positives (0.1%)

When a new user wants to register, in most cases, my UI tells them "this name is not in use, you're good to go".

But what does the backend need to do if a positive match is found?

The result might be a false-positive. Would finding out the true answer not add to time-complexity and thus make Bloom filters inefficient in many cases?

Telling a user "Name aready in use, choose a different oe" might not be so bad, but what about other use cases where you cannot be wrong.

2

2 Answers

2
votes

The general model for using Bloom filters goes like this:

  1. Query the filter to see whether the answer might be yes.
  2. If the Bloom filter says no, the answer is definitely no.
  3. If the Bloom filter says yes, the answer might be yes, so query a more accurate data structure to get a final determination.

Bloom filters really shine when step (3) is of the form “query some server somewhere to search a gigantic database to see if you have the item in question.” In that case, reducing the number of times the server needs to get pinged in order to make a determination can lead to huge performance gains on the client and reduce the load on the servers.

On the other hand, if you’re storing a small data set locally on a machine, then a Bloom filter isn’t likely to do all that much because querying that data set directly is probably going to be fast enough for all your needs.

2
votes

Bloomfilter is used in cases only when you want to detect whether an incoming word already exists in our database.

To be specific bloomfilter outputs either

  • Yes- that means a word may or may not exist in our database
  • No- that means it is 100 percent sure that the word doesn't exist

Moreover, a false positive should not harm in your case but still, if you are really worried, the errorrate of bloomfilters is given by the formula:

errorrate = [1-(1-(1/m))^(k*i)]^k

where i is number of current insertions, i. e. if you are inserting 20th word in the bloom filter then i=20 k is number of hash functions m is size of bit array

So you can go ahead and minimize it to suite your need.

Moreover the optimal number of hash functions used is given by

k=(m/n)log 2

where n is total number of words to be inserted. So you can choose the number of hash functions accordingly.