Writing a Hash Table in Go

Inspired by the recent post from Ben Hoyt, a recent refresher of Computer Science fundamentals and my journey on learning the Go programming language, I’ve implemented a hash table in Go.

Hash Table is a great data structure, they are to Balanced Trees what linear time sorting is to comparison sorts. By not relying on the comparison model, they allow you go below the log n lower bound for searching provided by Balanced Trees. Hash Tables allow you to do constant time look-ups and amortized constant time addition and deletion. This means that adding and deleting items will take mostly O(1), but some operations might take a little longer (i.e. O(n) for rehashing). This isn’t a big problem though, as we will see later in this post, if we choose our strategies well, we can assume that this won’t happen very often and for most operations we will run in constant time.

This implementation is for learning purposes only, and shouldn’t be used for anything serious. I had quite some fun writing the code and relearning Hash Table concepts. This post walks through some of the learnings I had while implementing this, together with the code for the important parts.

Link to the GitHub Repository

Abstract Data Structure

Insert (key, value)
Delete (key)
Search (key) -> value

Any hash table has to comply with some version of the abstract data structure above. Other common data structures also implemented this Abstract Data Structure like lists and trees. But where a Hash Map shines is by allowing that all these operations can have constant time complexity (with high probability). This is possible because searching on a Hash Table doesn’t rely on comparisons, like Trees and Lists. By treating our keys as raw values (i.e. numbers), we can do arithmetic operations other than comparing two values. However, this brings us to the first constraint of making a Hash Table:

Constraint 1: All keys in a hash table must be hashable. To put it simply: we must be able to transform a key into a number.

When a Hash Table receives a key and a value, it generates an index based on the key received and inserts the value in this index. Determining the index based on another value is the responsibility of a hash function. The engineer implementing the Hash Table will decide which hash function to implement. Since we’re mapping a universe set of keys to a finite number of indexes (our internal array size) we must keep in mind that multiple keys might map to the same internal index. We call this a collision. To keep collisions to a minimum, one has to be careful when choosing a hash function. Too many collisions could be the bottleneck of a Hash Table implementation, and it is the fault of a poor hash function choice. The “Simple Uniform Hashing” assumption defines an exemplary model for hash functions. The assumption is: each key in the key space is equally likely to be hashed to any slot of the table, independent of where other keys are hashed. Unfortunately, this is not achievable, but it gives us a guideline for a good function.

Constraint 2: The constant runtime promise of our Hash Map heavily depends on the hash function we choose.

Due to limited array space, after many additions to the Hash Table, collisions will become more frequent. In order to keep the constant time operations, we have put guards to avoid that our array gets too full. The generic solution to that is to extend the array and rehash all entries. In order to make a conscious decision of when should the array be resized we calculate the load factor of our Hash Map: n/m where n is the number of items and m is the size of the internal array.

Constraint 3: The constant runtime promise of our Hash Map heavily depends on keeping the load factor < 1 (i.e. n < m)

The Hash Function

The division method approach:

h(k) = k mod m

Given a key integer, get the modulo with m our internal array size. It can get messy quickly. If our keys have any patterns (e.g. only even numbers) we will have a lot of unused spaces and a lot of collisions.

I’m going to choose the same Hash Function that Ben chose in his post. It is easy to implement and allows me to more easily use Strings, since it relies on a byte array rather than a number.

Fowler-Noll-Vo (FNV)

Wikipedia Link

func hashValue(v PreHashable, limit int) int {
	hash := fnvOffSetBasis
	for _, b := range v.HashBytes() {
		hash = hash ^ uint64(b)
		hash = hash * fnvPrime
	}
	return int(hash % uint64(limit))
}

This function takes any key that can be converted into a byte array and returns a big integer (unsigned 64 bit) as a response. We module the result with our maximum array size to get a number that the internal array can index.

PreHashable

Before hashing the value, we have to pre-hash it. Which means making sure our hash function can hash the key value. This process usually means transforming any value into a number, but since our hash function takes a []byte, this is the interface we need:

type IntKey int
type StringKey string

type PreHashable interface {
	HashBytes() []byte
	Equal(PreHashable) bool
}

func (i IntKey) HashBytes() []byte {
	buf := make([]byte, binary.MaxVarintLen64)
	n := binary.PutVarint(buf, int64(i))
	return buf[:n]
}

func (i IntKey) Equal(other PreHashable) bool {
	v, ok := other.(IntKey)
	return ok && i == v
}

func (str StringKey) HashBytes() []byte {
	return []byte(str)
}

func (str StringKey) Equal(other PreHashable) bool {
	v, ok := other.(StringKey)
	return ok && str == v
}

The HashTable Data Structure

Our Table structure contains the internal access array, called buckets. Each bucket has a list of entries, which is the Data Structure that holds the original key and value. It also holds a length, that counts the number of tuples added, and it’s exported via a method called Len() . The initial bucket array size can be parameterized or left to a default value. In this example, initialM is set to 16.

type Table struct {
	length  int
	buckets [][]entry
}

type entry struct {
	key   PreHashable
	value interface{}
}

func NewSized(initial int) *Table {
	return &Table{
		buckets: make([][]entry, initial),
	}
}

func New() *Table {
	return NewSized(initialM)
}

Handling Collisions

There are two ways of handling collisions in hash tables: Chaining and Open Addressing. I implemented chaining in this example. In the chaining method, each value in our Hash Map is a list. In principle this can be a linked list, but any array would do. Since we strive to keep the collisions to a minimum, and if we can keep the promise of a good hash function, the number of items in these lists should be close to 1. This means that if we have multiple items in a chain, a simple linear search to find the correct would hurt our overall running time.

In order to both store multiple items and also compare the original key (to find the real item), the HashTableEntry struct contains both the interface{} value and the original PreHashable key. On look up that’s what has to be used to locate the correct entry.

func (ht *Table) Get(key PreHashable) (interface{}, bool) {
	hash := hashValue(key, len(ht.buckets))
	for _, v := range ht.buckets[hash] {
		if v.key == key {
			return v.value, true
		}
	}
	return nil, false
}

Table expanding

As said in the beginning, in order to keep the average constant time look-ups, we have to make sure that our load factor isn’t too close to 1. A load factor of 1 means that we have the same number of keys in our Hash Map as we have spaces. This is evidently bad, because it means that every additional element will cause a collision. To avoid that, we applied the table expanding technique, which, whenever our table is half full (reaches a load factor of 0.5) we double its size.

Doubling the table means creating a new internal array and copying all the values over. This process is bound to a O(n) curve, which doesn’t comply to our promise of constant time operations. However, if we double the table every time we reach 0.5 of our load factor, this process will only execute log n time for 2n items added to our Hash Table. If we started at 2 and added 64 items our table will grow to 128 spaces, performing log2 128 = 7 table expanding operations.

This is how I implemented table expanding and the Set method:

func (ht *Table) loadFactor() float32 {
	return float32(ht.length) / float32(len(ht.buckets))
}

func (ht *Table) Set(key PreHashable, value interface{}) {
	hash := hashValue(key, len(ht.buckets))
	// check if key is already added, if yes, just overwrite
	for i, e := range ht.buckets[hash] {
		if e.key == key {
			ht.buckets[hash][i].value = value
			return
		}
	}

	ht.buckets[hash] = append(ht.buckets[hash], entry{key, value})
	ht.length += 1
	if ht.loadFactor() > loadFactorThreshold {
		ht.expandTable()
	}
}

func (ht *Table) expandTable() error {
	newTable := make([][]entry, len(ht.buckets)*2)
	for _, bucket := range ht.buckets {
		for _, e := range bucket {
			newHash := hashValue(e.key, len(newTable))
			newTable[newHash] = append(newTable[newHash], entry{e.key, e.value})
		}
	}
	ht.buckets = newTable
	return nil
}

Deletion

Deleting for the Hash Map is a fairly trivial process. I choose not to implement down-scaling of the Hash Map, but the same principles of table doubling can be applied to shrinking the table. What needs to be considered is that one shouldn’t have the same metric for doubling as to shrinking. If we shrink the array by being half empty, operations that add and delete close to the threshold would always result in a O(n) operation, which is the whole point we’re trying to avoid. For that reason, choose a load factor of ~1/3 of the array before shrinking. This will leave some buffer before having to rehash again.

func (ht *Table) Delete(key PreHashable) error {
	hash := hashValue(key, len(ht.buckets))
	for i, v := range ht.buckets[hash] {
		if v.key == key {
			current := ht.buckets[hash]
			current[i] = current[len(current)-1]
			current = current[:len(current)-1]
			ht.length -= 1
			ht.buckets[hash] = current
			return nil
		}
	}

	return fmt.Errorf("Key error")
}

Further points

We haven’t touched on API design on this post, but mostly I kept the implementation as simple as possible and close to the Abstract Data Structure. There isn’t much error handling being done, and in case of memory problems the program might simply panic.

As per Cormen and many other sources, the best approach to the hash function is to use Universal Hashing, which combines different hash functions and avoids possible exploits from malicious users. If a user is aware of the implemented hash function, the input data can be manipulated in a way that all values will end up in the same bucket. Defeating our constant time promise. Throughout this process, I made no validations to the actual distribution of the keys I was testing with. However, this exercise is beyond the goals of this post.

There is a way to avoid the O(n) rehashing time in a single operation by distributing the process in smaller chunks. These chunks can be sneaked in other operations or as a background task. Assuming this wouldn’t change the real runtime of our algorithm, just distribute the time a little better and avoid spikes. I haven’t implemented this approach, but might give it a try in a later point in time. Two examples of such behavior, where a distributed rehashing spans across other operations: in the Golang’s map implementation and Redis dict.

If you read this far, I hope you found this post useful and encouraged you to experiment writing your own Hash Table.

Acknowledgements

Big thanks to Ben Hoyt for reviewing the code, for the insightful discussion on how to implement this and for catching all the problems with the original implementation.

#Go #Hashtable #Scratch

⇦ Back Home | ⇧ Top |

If you hated this post, and can't keep it to yourself, consider sending me an e-mail at fred.rbittencourt@gmail.com. I'm more responsive to positive comments though.