A Bloom filter with >500M items, even when allowing for a comparatively high rat...

KMag · on Feb 22, 2018

The compressed archive here is over 8 GB. An uncompressed 2 GB Bloom filter with 24 hash functions and half a billion entries has a false positive rate of less than 1 in 14 million.

75% space savings, with no decompression necessary for use, and a 1 in 14 million false positive rate is nothing to sneeze at.

rrobukef · on Feb 22, 2018

But no count of how often the hash is used. Counting bloom filters are till a bit harder to implement.

KMag · on Feb 22, 2018

Counting bloom filters are only marginally more difficult to implement. To increment a key, find the minimum value stored in all of the slots for the key, and then increment all of the stored values for that key that are equal to the minimum value. To read, return the minimum value for all of the values stored in slots for the key.

For these purposes, however, you probably instead want to store just separate Bloom filters for counts above different thresholds, since the common use case would be accept/reject decisions based upon a single threshold.