Collisions
Definition
By now you've noticed that if two separate entities are hashed to the same integer value, they will want to, obviously, occupy the same index. When this happens, it's a collision.
Goal
How do we handle collisions, while keeping optimal hashing conditions?
Compression Factor
- : Compression factor
- : Occupied cells
- : Total cells
If the compression factor goes up, the number of collisions (obviously) goes up as well.
Types of resolutions:
When we need to insert an element in the table and reach a collision, we can't just omit the element, we have to find a systematic way to continue with the insertion at all costs.
Chaining
Let us treat our cells as linked lists, this would make our table, essentially an array of linked lists. Thus, when a new element causes a collision, no problem, we still associate it to that cell, but append it to that cell's linked list.
Probing
Instead of simply starting linked lists and adding weight to our structure (essentially storing blocks of memory everywhere), a more elegant solution is to look for the next available cell.
But, how do we select the next available cell?
The general form of a hashing function using probes would be:
- : data to be inserted
- : Insertion attempt index
- : Probing function
- : Current index (that caused collision, which will be incremented)
- : Table current size
There are two basic techniques:
- Linear probing
- Quadratic probing
Linear probing consists of, once a collision is found, advanced by constant increments, cell by cell until a free cell is found.
For example, if our data is colliding at index 3, we advance to 4, then 5, and so forth.
Quadratic probing is the next step, instead of finding constant increments, we find increments in a quadratic function of the step.
For example if our data collides at index 0, we go to 1, then 4, then 9, and so forth.
Double hashing
The problem with probing functions is we usually experience heavy clustering of data inserted in the table because the offset is so heavily dependant on the step.
In this case we introduce a second hashing function:
- : Any first hashing function
- : Any second hashing function
This reduces predictable clustering, as we no longer depend on a step so closely, but we have a totally separate hashing function to calculate the offset.
Rehashing
No matter the resolution you choose for your collisions in the hash table, your table will undoubtedly start filling up.
We want to be proactive and not reactive when it comes to space management in hashtables.
We don't want to wait for the table to fill up to then resize, which could prove to be computationally costly at a critical juncture unnecessarily.
So we rehash the table to accomplish two things:
- Resize the table.
- Map the existing values to new cells in the new table (with the new size).
Resizing
We like to keep sizes in hash tables to prime numbers, because this drastically reduces the amount of possible collisions.
So, when resizing, we find the next prime after doubling the current size.
A size of 7, then goes to 17, then 37, then 79, and so forth
Mapping
This is basically just reinserting each element again by going through the whole process all over again.