containers/hash.hpp
A Chained Hash Table Container

Introduction

This component is an STL-like template class which implements a hash table data structure. It has a similar usage and interface to the STL map.

The hash uses separate chaining of elements - in other words, there is a bin per hash value and all keys with that hash value are stored in the same bin. There is no collision resolution to this design, so insertions can never fail. However, if bins become too full, efficiency will drop. For this reason the hash implements rehashing.

This document assumes you know basically what a hash table is and only deals with how to use the hash class.

Instantiation

The hash template takes four types:

namespace stlplus
{
  template<typename K, typename T, class H, class E = equal_to<K> > class hash
}

The first type K is the key type which is used to access elements of the hash table. A hash is in effect an associative store in which the key type is used like the index of an array and the data type is the contents of the element.

The second type T is the data type stored in the hash and accessed via the key.

The third type is the hash function object. A function object is used in the same way as in the map to define the hashing operation of the hash table. An example of a hash function object for a string key is:

struct hash_string
{
  unsigned operator()(const std::string& s) const
  {
    ...
  }
};

The convention is to define a struct or class which contains a function call operator (the operator()) to perform the hashing function. For the hash class, the hashing function takes a key type K and returns an unsigned.

The fourth type is the equality function object. It is similar to the hash function object except that it takes two keys and returns a bool. The default value is the equal_to function object defined in the STL which simply applies the == operator to the two keys. You can therefore define key equality by an appropriate overload of the == operator for the key type. However, if that creates problems you can create a new function object for equality:

struct equal_string
{
  bool operator()(const std::string& l, const std::string& r) const
  {
    ...
  }
};

Thus, to create a hash which takes a string as a key and an int as the data, you would declare a variable like this:

stlplus::hash<std::string,int,hash_string> table;

This creates a variable called table which is a hash from string to int, using your hash_string function object for hashing the key, but using the default equal_to function object for comparisons.

It is clearer using typedefs to define the template types:

typedef stlplus::hash<std::string,int,hash_string> string_int_hash;
string_int_hash table;

The constructor for hash takes an optional argument which defines the number of bins to create in the hash table. If not specified, the hash chooses the number of bins. However, to override it, just give an argument to the constructor:

string_int_hash table(50);

This creates a table with 50 bins.

Note that the hash function can return any unsigned value, it does not have to be in the range 0..bins-1, since the hash class normalises the hash value to this range. This means the same hash function can be used with hashes of different sizes.

Rehashing

Rehashing is where the internal table in the hash is resized to better suit the number of data items. In other words, the number of bins is changed and that means the hash values of all the keys must be recalculated (thus the term rehash). It is a relatively expensive operation but increases the efficiency of the hash. It is pretty well essential if the initial data size of the hash is unknown.

The hash class supports two rehashing modes: auto-rehash and manual-rehash.

Auto-rehash

Auto-rehash mode means that the hash itself decides when to rehash itself. This rehashing is only performed on insertions which increase the loading (ratio of data elements in the table to the number of bins) over 100%.

Auto-rehash is enabled automatically if a hash is constructed with an unspecified number (or zero) bins. This tells the hash to choose the initial size and enable auto-rehashing. It can also be enabled explicitly using the auto_rehash() function.

To leave the choice of initial size to the hash and enable auto-rehash simply don't give an argument to the constructor:

string_int_hash table;

To choose the initial size yourself, but enable auto-rehashing:

string_int_hash table(50);
table.auto_rehash();

The main disadvantage of auto-rehashing is that performing a rehash invalidates all iterators. Since auto-rehash can happen at any time (well, on any insert), this can be unpredictable.

Manual-rehashing

Manual rehashing leaves rehashing decisions to the user, who may decide not to do any at all. This is appropriate for problems where the data size is known in advance and therefore the number of bins can be calculated before the hash is constructed. However, you may want to do rehashing but to make the rehashing decision yourself.

Constructing a hash with a specified number of bins automatically switches the hash to manual-rehashing mode.

string_int_hash table(50);

If you want to do your own rehashing, then a useful function is the loading() function which gives the current hash loading (size divided by number of bins) as a float. Thus you can choose to rehash at a specific point in your program by calling the rehash function explicitly when the loading gets too high:

if (table.loading() > 0.9)
  table.rehash();

The rehash() method takes an optional integer argument. If it is missing or zero, this means implement the built-in calculation for the number of bins. This simply doubles the number of bins. If it is called with a non-zero argument, then the hash will be rehashed to that number of bins.

The main disadvantage of manual hashing is where the data size is unknown and can grow large compared with the number of bins. Basically a loading of greater than 0.5 leads to efficiency creeping down and loadings of greater than 1.0 can really hit performance.

Whole Hash Operations

hash(const hash&);
hash& operator=(const hash&);
void clear(void);
bool empty(void) const;
unsigned size(void) const;

The hash class supports copying via the copy constructor and the assignment operator.

The clear() function deletes all elements in the hash, leaving it empty, but still with the same number of bins as when it was created.

The size() function gives the number of elements in the hash. This is very efficient since the hash keeps track of its size, so it is not necessary for it to count the number of elements. The empty() function tests for the special case of an empty hash, i.e. a hash with a size() == 0.

Adding, Removing, Changing and Accessing Hash Elements

The hash class provides a set of operations which allow the contents of the hash to be created, modified and deleted. For the most common operations on a hash, there's typically at least two ways of doing it.

bool present(const K& key) const;

To test for the existence of a key in the hash, the present() function can be used. This takes a key as an argument and returns true if that key exists in the hash. It is also possible to test for the presence of a key using find() - see later.

size_type count(const K& key) const;

Counts the number of instances of the key in the hash. This can only return 0 or 1 since the hash only allows one instance of a key. This method is included for compatibility with std::map which has a count method.

iterator insert(const K& key, const T& data);

This insert method adds a new key/data pair to the hash. If the key is already present in the hash, then the new value replaces the old value. The insert method returns an iterator to the new element so that it can be inspected or its data field modified.

std::pair<iterator, bool> insert(const value_type& value);

This insert method adds a new value (a key and data stored in a std::pair) to the hash. If the key is already present in the hash, then the new value replaces the old value. This insert method returns a pair containing a bool indicating success/failure and an iterator to the new element in the case of success so that it can be inspected or its data field modified. This insert method has been included for compatibility with std::map.

iterator insert(const K& key);

This insert method adds a new key/data pair to the hash, but this time with a default-contructed data field. If the key is already present in the hash, then the new value replaces the old value. The insert method returns an iterator to the new element so that it can be inspected or its data field modified.

unsigned erase(const K& key);

The erase() function removes the key and its associated data from the hash. The return value is the number of elements erased. If the key was removed, the function returns 1. However, if the key wasn't present and couldn't therefore be removed, it returns 0. This return value is rarely useful but can also be interpreted as a boolean success/fail flag.

iterator erase(iterator);

The alternative erase() method removes the value pointed to by the iterator. The return value is the iterator to the next element. This method is consistent with the std::map::erase method.

Note: std::map also has an erase for a range denoted by two iterators. This doesn't make sense for hashes because they are unsorted and therefore there is no meaningful interpretation of a range of keys.So there is no equivalent erase method in the hash.

const T& operator[] (const K& key) const;
T& operator[] (const K& key);

The index operators take the key as an argument and return the data type. Thus for our example, they take a string as an argument and return an int.

The first form is for use on const hashes and is a read-only indexing operator. If you try to access a key that is not present, the out_of_range exception is thrown. If the key is present, then the data associated with it is returned as a const so that it cannot be modified.

The second form is for use on non-const hashes and can be used to read from or write to the hash. The behaviour is much like the [] operator for map. If the key is not present, then a new element is created with that key and a data field initialised with its default constructor. The data field is then returned. If the key is present, its data field is simply returned. Since the data is returned by reference, it can then be used as the target of an assignment to set the data field to a different value. For example:

table["one"] = 1;
table["two"] = table["one"] + 1;

It is important to remember that, as with the map, the index operator can change the contents of the hash. If you don't want this, either guard against it using present() or avoid the issue completely by using find().

const T& at(const K& key) const;
T* at_pointer(const K& key) const;

The hash::at() method has the same functionality as the hash::operator[] method in read-only mode, but does not have the catch that in read-write mode, [] creates a new value if the key is not found. The at() method throws an exception when the key is not found. The at_pointer() method returns a pointer to the value, rather than a reference and returns null if the key is not found.

const_iterator find(const K& key) const;
iterator find(const K& key);

The find() functions are the most flexible way of accessing elements in a hash, although the use of iterators can make them a bit clumsy. The first form works on const hashes and returns a const_iterator, which means you can look at the element but not change it. The second form works on non-const hashes, giving a non-const iterator, thus allowing the data field (not the key) to be changed.

If the key is not present in the hash, the iterator will be equal to end(). It is illegal to try to dereference an iterator equal to end, so make this check first. Dereferencing an iterator accesses the value type stored in the hash, which is in fact the same type used in the STL map - a std::pair containing a const key type and a data type. Dereferencing of iterators will be dealt with in the section on iterators.

Iterators

Iterators are used to access key/data pairs in the hash. Thus an iterator is used as the result of the find() function so that the elements can then be accessed as a pair via the iterator. Iterators can also be used to access all elements in the hash in turn.

An iterator acts like a pointer to the hash's value_type which is a pair:

typedef std::pair<const K, T>    value_type;

The key is accessed through the pair's 'first' element and the data through the second. For example:

string_int_hash table(50);
...
string_int_hash::iterator six = table.find("six");
if (six == table.end())
  std::cout << "key \"six\" not found in table" << std::endl;
else
  std::cout << "key \"six\" refers to " << six->first << " = " << six->second << std::endl;

Note how the iterator is dereferenced and then the pair's fields are accessed in the final line.

It is legal to assign to the data field via an iterator (i.e. six->second), but not the key. If the key were to be changed, that would change the hashing value and so the hash would be broken. Therefore it is disallowed by making that field const.

An iterator traversal of the hash is controlled by two functions with the same naming conventions as in STL classes:

const_iterator begin(void) const;
iterator begin(void);
const_iterator end(void) const;
iterator end(void);

These functions provide both const and non-const functions of the begin() and end() functions that can be used to visit all elements in a hash. The const form (using a const_iterator) should be used on a const hash, whilst the non-const form (using an iterator) should be used on a non-const hash. You need to match the type of iterator to the const-ness of the hash.

In use the iterators are much the same as the map iterators:


for (string_int_hash::iterator i = table.begin(); i != table.end(); i++)
{
  std::cout << "Key: " << i->first << ", data = " << i->second << std::endl;
}

The main difference between the hash and the map is that the hash will visit the elements in no particular order since there is no concept of a sort order for a hash, whereas the map is sorted. You should therefore not rely on any ordering when using iterators in this way.

The hash's iterators are unidirectional - that is, you can only traverse the elements in a forward direction from begin() to end(). There is only a "++" operator and not a "--" operator for an iterator and there are no reverse iterators. It seems to me that since there is no such thing as a sort ordering on a hash, there's no need for any other iteration order.

STLplus iterators are slightly different to STL iterators in that they know which container they belong to. When an iterator is first declared, but not assigned to, it is known as a null iterator. Then, when it is assigned to so that it points to a value in the hash, it is known as a valid iterator. If the iterator is made to point off the end of the hash - for example by incrementing to the value returned by the end() method above, then it becomes an end iterator.

This concept is similar to STL iterators like the list iterator, where an iterator that walks off the end of the list becomes an end iterator. However, the STL has no concept of a null iterator and therefore no direct equivalent of a valid iterator.

The iterator has a set of functions for testing these conditions:

bool iterator::null(void) const;
bool iterator::end(void) const;
bool iterator::valid(void) const;

Note that these functions are members of the iterator class, not the hash class.

Exactly one of these conditions will always be true, since the iterator must be one of null, end or valid.

Only a valid iterator can be dereferenced to access the element pair pointed to by the iterator. Any attempt to dereference an end or null iterator will result in an exception being thrown. See the section below on Exceptions to see which.

Diagnostic Print Routines

A hash table is very sensitive to the hashing function and the number of bins in the hash. To help diagnose performance problems with the hash class, there is a diagnostic print routine with three forms: a member function and two friend functions:

void debug_report(std::ostream&) const;

This method prints out a table showing the number of elements in each bin and some summary statistics. The table looks like this:

------------------------------------------------------------------------
| size:     7
| bins:     20
| loading:  0.35 auto-rehash
| occupied: 7 (35.0%), min = 0, max = 1
|-----------------------------------------------------------------------
|  bin         0     1     2     3     4     5     6     7     8     9
|        ---------------------------------------------------------------
|      0 |     1     .     1     1     .     1     .     .     .     .
|     10 |     .     .     .     1     .     .     .     1     1     .
------------------------------------------------------------------------

The size field is the number of elements in the hash. The bins is the number of bins. Loading represents the ratio of elements to bins and indicates whether the hash is configured for auto or manual rehashing. The occupied field gives stats on the number of bins occupied as a number and as a percentage, followed by the minimum and maximum occupancy of any bin in the hash. Finally, the table reports the number if elements in each bin so that the distribution can be seen.

Exceptions

There are 4 exceptions that can be thrown by hash:

stlplus::null_dereference - logic_error
This is used to indicate that an iterator which is null has been dereferenced. You should always be sure that iterators are non-null before using the * or -> operators.
stlplus::end_dereference - logic_error
This is used to indicate that an iterator that is pointing to an end element has been dereferenced. In line with STL conventions, the end iterator points to the element after the last element in a container, so dereferencing such an iterator is illegal (the memory pointed to will be unconstructed).
stlplus::wrong_object - logic_error
This is used to indicate that an iterator which was created by one container has been used in a different container. For example, if you have two hash objects in your program, create an iterator from one hash and then use it in another, then this will throw the wrong_object exception.
std::out_of_range - logic_error
This is used to indicate that the const version of the index operator[] has been used to try to access a value using a key that was not found in the hash.

The first three exceptions are common to all STLplus iterators and are explained in the STLplus exceptions policy.