Writing a Hash Table in Go
Inspired by the recent post from Ben Hoyt, a recent refresher of Computer Science fundamentals and my journey on learning the Go programming language, I’ve implemented a hash table in Go.
Hash Table is a great data structure, they are to Balanced Trees what linear time sorting is to comparison sorts. By not relying on the comparison model, they allow you go below the log n
lower bound for searching provided by Balanced Trees. Hash Tables allow you to do constant time look-ups and amortized constant time addition and deletion. This means that adding and deleting items will take mostly O(1)
, but some operations might take a little longer (i.e. O(n)
for rehashing). This isn’t a big problem though, as we will see later in this post, if we choose our strategies well, we can assume that this won’t happen very often and for most operations we will run in constant time.
This implementation is for learning purposes only, and shouldn’t be used for anything serious. I had quite some fun writing the code and relearning Hash Table concepts. This post walks through some of the learnings I had while implementing this, together with the code for the important parts.
Abstract Data Structure
Insert (key, value)
Delete (key)
Search (key) -> value
Any hash table has to comply with some version of the abstract data structure above. Other common data structures also implemented this Abstract Data Structure like lists and trees. But where a Hash Map shines is by allowing that all these operations can have constant time complexity (with high probability). This is possible because searching on a Hash Table doesn’t rely on comparisons, like Trees and Lists. By treating our keys as raw values (i.e. numbers), we can do arithmetic operations other than comparing two values. However, this brings us to the first constraint of making a Hash Table:
Constraint 1: All keys in a hash table must be hashable. To put it simply: we must be able to transform a key into a number.
When a Hash Table receives a key and a value, it generates an index based on the key received and inserts the value in this index. Determining the index based on another value is the responsibility of a hash function. The engineer implementing the Hash Table will decide which hash function to implement. Since we’re mapping a universe set of keys to a finite number of indexes (our internal array size) we must keep in mind that multiple keys might map to the same internal index. We call this a collision. To keep collisions to a minimum, one has to be careful when choosing a hash function. Too many collisions could be the bottleneck of a Hash Table implementation, and it is the fault of a poor hash function choice. The “Simple Uniform Hashing” assumption defines an exemplary model for hash functions. The assumption is: each key in the key space is equally likely to be hashed to any slot of the table, independent of where other keys are hashed. Unfortunately, this is not achievable, but it gives us a guideline for a good function.
Constraint 2: The constant runtime promise of our Hash Map heavily depends on the hash function we choose.
Due to limited array space, after many additions to the Hash Table, collisions will become more frequent. In order to keep the constant time operations, we have put guards to avoid that our array gets too full. The generic solution to that is to extend the array and rehash all entries. In order to make a conscious decision of when should the array be resized we calculate the load factor of our Hash Map: n/m
where n
is the number of items and m
is the size of the internal array.
Constraint 3: The constant runtime promise of our Hash Map heavily depends on keeping the load factor < 1 (i.e. n < m
)
The Hash Function
The division method approach:
h(k) = k mod m
Given a key
integer, get the modulo with m
our internal array size. It can get messy quickly. If our keys have any patterns (e.g. only even numbers) we will have a lot of unused spaces and a lot of collisions.
I’m going to choose the same Hash Function that Ben chose in his post. It is easy to implement and allows me to more easily use Strings, since it relies on a byte array rather than a number.
Fowler-Noll-Vo (FNV)
func hashValue(v PreHashable, limit int) int {
hash := fnvOffSetBasis
for _, b := range v.HashBytes() {
hash = hash ^ uint64(b)
hash = hash * fnvPrime
}
return int(hash % uint64(limit))
}
This function takes any key that can be converted into a byte array and returns a big integer (unsigned 64 bit) as a response. We module the result with our maximum array size to get a number that the internal array can index.
PreHashable
Before hashing the value, we have to pre-hash it. Which means making sure our hash function can hash the key value. This process usually means transforming any value into a number, but since our hash function takes a []byte
, this is the interface we need:
type IntKey int
type StringKey string
type PreHashable interface {
HashBytes() []byte
Equal(PreHashable) bool
}
func (i IntKey) HashBytes() []byte {
buf := make([]byte, binary.MaxVarintLen64)
n := binary.PutVarint(buf, int64(i))
return buf[:n]
}
func (i IntKey) Equal(other PreHashable) bool {
v, ok := other.(IntKey)
return ok && i == v
}
func (str StringKey) HashBytes() []byte {
return []byte(str)
}
func (str StringKey) Equal(other PreHashable) bool {
v, ok := other.(StringKey)
return ok && str == v
}
The HashTable Data Structure
Our Table
structure contains the internal access array, called buckets
. Each bucket has a list of entries, which is the Data Structure that holds the original key and value. It also holds a length
, that counts the number of tuples added, and it’s exported via a method called Len()
. The initial bucket
array size can be parameterized or left to a default value. In this example, initialM
is set to 16
.
type Table struct {
length int
buckets [][]entry
}
type entry struct {
key PreHashable
value interface{}
}
func NewSized(initial int) *Table {
return &Table{
buckets: make([][]entry, initial),
}
}
func New() *Table {
return NewSized(initialM)
}
Handling Collisions
There are two ways of handling collisions in hash tables: Chaining and Open Addressing. I implemented chaining in this example. In the chaining method, each value in our Hash Map is a list. In principle this can be a linked list, but any array would do. Since we strive to keep the collisions to a minimum, and if we can keep the promise of a good hash function, the number of items in these lists should be close to 1. This means that if we have multiple items in a chain, a simple linear search to find the correct would hurt our overall running time.
In order to both store multiple items and also compare the original key (to find the real item), the HashTableEntry
struct contains both the interface{}
value and the original PreHashable
key. On look up that’s what has to be used to locate the correct entry.
func (ht *Table) Get(key PreHashable) (interface{}, bool) {
hash := hashValue(key, len(ht.buckets))
for _, v := range ht.buckets[hash] {
if v.key == key {
return v.value, true
}
}
return nil, false
}
Table expanding
As said in the beginning, in order to keep the average constant time look-ups, we have to make sure that our load factor isn’t too close to 1. A load factor of 1 means that we have the same number of keys in our Hash Map as we have spaces. This is evidently bad, because it means that every additional element will cause a collision. To avoid that, we applied the table expanding technique, which, whenever our table is half full (reaches a load factor of 0.5
) we double its size.
Doubling the table means creating a new internal array and copying all the values over. This process is bound to a O(n)
curve, which doesn’t comply to our promise of constant time operations. However, if we double the table every time we reach 0.5
of our load factor, this process will only execute log n
time for 2n
items added to our Hash Table. If we started at 2
and added 64
items our table will grow to 128
spaces, performing log2 128 = 7
table expanding operations.
This is how I implemented table expanding and the Set
method:
func (ht *Table) loadFactor() float32 {
return float32(ht.length) / float32(len(ht.buckets))
}
func (ht *Table) Set(key PreHashable, value interface{}) {
hash := hashValue(key, len(ht.buckets))
// check if key is already added, if yes, just overwrite
for i, e := range ht.buckets[hash] {
if e.key == key {
ht.buckets[hash][i].value = value
return
}
}
ht.buckets[hash] = append(ht.buckets[hash], entry{key, value})
ht.length += 1
if ht.loadFactor() > loadFactorThreshold {
ht.expandTable()
}
}
func (ht *Table) expandTable() error {
newTable := make([][]entry, len(ht.buckets)*2)
for _, bucket := range ht.buckets {
for _, e := range bucket {
newHash := hashValue(e.key, len(newTable))
newTable[newHash] = append(newTable[newHash], entry{e.key, e.value})
}
}
ht.buckets = newTable
return nil
}
Deletion
Deleting for the Hash Map is a fairly trivial process. I choose not to implement down-scaling of the Hash Map, but the same principles of table doubling can be applied to shrinking the table. What needs to be considered is that one shouldn’t have the same metric for doubling as to shrinking. If we shrink the array by being half empty, operations that add and delete close to the threshold would always result in a O(n)
operation, which is the whole point we’re trying to avoid. For that reason, choose a load factor of ~1/3
of the array before shrinking. This will leave some buffer before having to rehash again.
func (ht *Table) Delete(key PreHashable) error {
hash := hashValue(key, len(ht.buckets))
for i, v := range ht.buckets[hash] {
if v.key == key {
current := ht.buckets[hash]
current[i] = current[len(current)-1]
current = current[:len(current)-1]
ht.length -= 1
ht.buckets[hash] = current
return nil
}
}
return fmt.Errorf("Key error")
}
Further points
We haven’t touched on API design on this post, but mostly I kept the implementation as simple as possible and close to the Abstract Data Structure. There isn’t much error handling being done, and in case of memory problems the program might simply panic.
As per Cormen and many other sources, the best approach to the hash function is to use Universal Hashing, which combines different hash functions and avoids possible exploits from malicious users. If a user is aware of the implemented hash function, the input data can be manipulated in a way that all values will end up in the same bucket. Defeating our constant time promise. Throughout this process, I made no validations to the actual distribution of the keys I was testing with. However, this exercise is beyond the goals of this post.
There is a way to avoid the O(n)
rehashing time in a single operation by distributing the process in smaller chunks. These chunks can be sneaked in other operations or as a background task. Assuming this wouldn’t change the real runtime of our algorithm, just distribute the time a little better and avoid spikes. I haven’t implemented this approach, but might give it a try in a later point in time. Two examples of such behavior, where a distributed rehashing spans across other operations: in the Golang’s map implementation and Redis dict.
If you read this far, I hope you found this post useful and encouraged you to experiment writing your own Hash Table.
Acknowledgements
Big thanks to Ben Hoyt for reviewing the code, for the insightful discussion on how to implement this and for catching all the problems with the original implementation.
If you hated this post, and can't keep it to yourself, consider sending me an e-mail at fred.rbittencourt@gmail.com or complain with me at Twitter X @derfrb. I'm also occasionally responsive to positive comments.