While not the most cache friendly operation, binary searching a sorted data set is a pretty good way to search a data set for a specific value because for N items, you have an average case and worst case of O(log N) (Wikipedia: Binary Search Algorithm).
Hashing is also a decent way to do data searching, because it has a best case for search of O(1) with the average case being able to be near that when tuned well, but in practice due to collisions it can get get up to O(n) (Wikipedia: Hash Table).
Note that if you know the keys in advance, you can ALWAYS get O(1) lookup by using the information from the last post: Minimal Perfect Hashing.
One way to deal with hash collisions is to store all the collisions for a specific hash value in a list or array.
For instance, if you had a 4 bit hash, you could have 16 arrays, where the th array stored all the items that hashed to 0, the th array stored all items that hashed to 1 and so on, going up to the th array which stored all items that hashed to 15.
What would happen if we stored the arrays in sorted order and then did a binary search within the buckets? What would that mean to search times?
Interestingly, assuming a good hash function, the answer is that every bit of hash you use subtracts one from the number of tests you’d have to do with binary searching (See footnote 1).
For instance, let’s say you had 1024 items sorted in an array, you would have to do up to 10 tests to search for a value in that array using a binary search since log2(1024)=10 (See footnote 2).
If we use a 1 bit hash, assuming a good hash function, that would split the array into two arrays each having 512 items in it. Those each can take up to 9 tests to search using a binary search since log2(512)=9. Doing the hash to choose which of the two lists to search cuts our search from up to 10 tests, down to 9 tests.
If instead we used a 2 bit hash, we would divide the 1024 item list into four lists each having 256 items in them. Each of these lists can be searched with up to 8 tests since log2(256) = 8. So using a 2 bit hash, we can cut our search down to 8 tests max.
Let’s say that we used an 8 bit hash. That would cut our list up into 256 lists, each of which only has 4 items in it. Each list can be searched with up to 2 tests since log2(4) = 2. Using an 8 bit hash cuts our max number of searches from 10 down to 2!
Let’s say we used a 16 bit hash, what would happen then?
Well, that would mean we had 65536 hash buckets, but only 1024 items to store in the buckets. If we had a best case collision free hash, that means only 1024 buckets had an item in them, and the rest were empty.
You’d have to hash the value you were searching for to get the hash bucket index, and if there was an item there, you’d have to test that item. If there was no item there, you could return that the value wasn’t found.
The hash isn’t free though, so this is basically O(1) or doing 1 test.
So, while each bit of hash subtracts one test from the binary search, it of course can’t go negative, or become free, so it basically has a minimum of 1.
- Technically it only approximately subtracts one from the number of tests you’d have to do, even if the hash function divides the list as evenly as possible, due to footnote 2 and not being able to split an odd number of items exactly in half.
- technically 1024 items in an array could take up to 11 tests! You can see why with 4 items with indices 0,1,2,3. First you’d test index 2. If the value we were looking for was greater, we’d test 3 and then be done with either a found or not found result. That’s just 2 searches total. But, if the value we were looking for was less than index 2, we’d have to test index 1, and then optionally test index 0 depending on how the search value compared to index 1. With only 3 items, indicies 0,1,2, we test index 1, then either index 0 or 2 and be done, and so only have to test twice. A binary search takes log2(N+1) tests, where N is the number of items you are looking through.
The last post i wrote on minimal perfect hashing was insanely popular (by my standards) on both reddit and hacker news. The previous highest number of visitors I had ever had to my blog in one day was about 230, but on the 16th I got 4187 visitors, and on the 17th I got 7277 visitors. That means there were over 10,000 visitors over those two days.
That is nuts!!
As a data point for those who might find it interesting, the bulk of the traffic from the first day was reddit, and the bulk of the traffic from the second day was hacker news.
I also found it interesting to see what all those people looked at after the minimal perfect hashing algorithm.
I’ve learned as the years have gone on so some of my older stuff probably contains more errors than my newer stuff (which is also not error free I’m sure).
Anyways, thanks for reading, and I hope you guys find my writings interesting and useful. Please feel free to comment and correct any misinformation, or let us know about better alternatives (: