The STL+ C++ Library Collection
This component is an STL-like template class which implements a hash table data structure. It has a similar usage and interface to the STL map.
The hash uses separate chaining of elements - in other words, there is a bin per hash value and all keys with that hash value are stored in the same bin. There is no collision resolution to this design, so insertions can never fail. However, if bins become too full, efficiency will drop. For this reason the hash implements rehashing.
This document assumes you know basically what a hash table is and only deals with how to use the hash class.
The hash template takes four types:
template<typename K, typename T, class H, class E = equal_to<K> > class hash
The first type K is the key type which is used to access elements of the hash table. A hash is in effect an associative store in which the key type is used like the index of an array and the data type is the contents of the element.
The second type T is the data type stored in the hash and accessed via the key.
The third type is the hash function object. A function object is used in the same way as in the map to define the hashing operation of the hash table. An example of a hash function object for a string key is:
struct hash_string
{
unsigned operator()(const string& s) const
{
...
}
};
The convention is to define a struct or class which contains a function call operator (the operator()) to perform the hashing function. For the hash class, the hashing function takes a key type K and returns an unsigned.
The fourth type is the equality function object. It is similar to the hash function object except that it takes two keys and returns a bool. The default value is the equal_to function object defined in the STL which simply applies the == operator to the two keys. You can therefore define key equality by an appropriate overload of the == operator for the key type. However, if that creates problems you can create a new function object for equality:
struct equal_string
{
bool operator()(const string& l, const string& r) const
{
...
}
};
Thus, to create a hash which takes a string as a key and an int as the data, you would declare a variable like this:
hash<string,int,hash_string> table;
This creates a variable called table which is a hash from string to int, using your hash_string function object for hashing the key, but using the default equal_to function object for comparisons.
The constructor for hash takes an optional argument which defines the number of bins to create in the hash table. If not specified, the hash chooses the number of bins. However, to override it, just give an argument to the constructor:
hash<string,int,hash_string> table(50);
This creates a table with 50 bins.
Note that the hash function can return any unsigned value, it does not have to be in the range 0..bins-1, since the hash class normalises the hash value to this range. This means the same hash function can be used with hashes of different sizes.
Rehashing is where the internal table in the hash is resized to better suit the number of data items. In other words, the number of bins is changed and that means the hash values of all the keys must be recalculated (thus the term rehash). It is a relatively expensive operation but increases the efficiency of the hash. It is pretty well essential if the initial data size of the hash is unknown.
The hash class supports two rehashing modes: auto-rehash and manual-rehash.
Auto-rehash mode means that the hash itself decides when to rehash itself. This rehashing is only performed on insertions which increase the loading (ratio of data elements in the table to the number of bins) over 100%.
Auto-rehash is enabled automatically if a hash is constructed with an unspecified number (or zero) bins. This tells the hash to choose the initial size and enable auto-rehashing. It can also be enabled explicitly using the auto_rehash() function.
To leave the choice of initial size to the hash and enable auto-rehash simply don't give an argument to the constructor:
hash<string,int,hash_string> table;
To choose the initial size yourself, but enable auto-rehashing:
hash<string,int,hash_string> table(50); table.auto_rehash();
The main disadvantage of auto-rehashing is that performing a rehash invalidates all iterators. Since auto-rehash can happen at any time (well, on any insert), this can be unpredictable.
Manual rehashing leaves rehashing decisions to the user, who may decide not to do any at all. This is appropriate for problems where the data size is known in advance and therefore the number of bins can be calculated before the hash is constructed. However, you may want to do rehashing but to make the rehashing decision yourself.
Constructing a hash with a specified number of bins automatically switches the hash to manual-rehashing mode.
hash<string,int,hash_string> table(50);
If you want to do your own rehashing, then a useful function is the loading() function which gives the current hash loading (size divided by number of bins) as a float. Thus you can choose to rehash at a specific point in your program by calling the rehash function explicitly when the loading gets too high:
if (table.loading() > 0.9) table.rehash();
The rehash() function takes an optional integer argument. If it is missing or zero, this means implement the built-in calculation for the number of bins. This simply doubles the number of bins. If it is called with a non-zero argument, then the hash will be rehashed to that number of bins.
The main disadvantage of manual hashing is where the data size is unknown and can grow large compared with the number of bins. Basically a loading of greater than 0.5 leads to efficiency creeping down and loadings of greater than 1.0 can really hit performance.
hash(const hash&); hash& operator=(const hash&); void erase(void); bool empty(void) const; unsigned size(void) const;
The hash class supports copying via the copy constructor and the assignment operator.
The erase() function deletes all elements in the hash, leaving it empty, but still with the same number of bins as when it was created.
The size() function gives the number of elements in the hash. This is very efficient since the hash keeps track of its size, so it is not necessary for it to count the number of elements. The empty() function tests for the special case of an empty hash, i.e. a hash with a size() == 0.
The hash class provides a set of operations which allow the contents of the hash to be created, modified and deleted. For the most common operations on a hash, there's typically at least two ways of doing it.
bool present(const K& key) const;
To test for the existence of a key in the hash, the present() function can be used. This takes a key as an argument and returns true if that key exists in the hash. It is also possible to test for the presence of a key using find() - see later.
iterator insert(const K& key, const T& data);
The insert() function adds a new key/data pair to the hash. If the key is already present in the hash, then the new value replaces the old value. It is also possible to insert a new element with the [] operator - see later. The insert() function returns an iterator to the new element so that it can be inspected or its data field modified. See later for details of iterator usage.
bool erase(const K& key);
The erase() function removes the key and its associated data from the hash. If the key was removed, the function returns true. However, if the key wasn't present and couldn't therefore be removed, it returns false. This return value is rarely useful but can have its uses.
const_iterator find(const K& key) const; iterator find(const K& key);
The find() functions are the most flexible way of accessing elements in a hash, although the use of iterators can make them a bit clumsy. The first form works on const hashes and returns a const_iterator, which means you can look at the element but not change it. The second form works on non-const hashes, giving a non-const iterator, thus allowing the data field (not the key) to be changed.
If the key is not present in the hash, the iterator will be equal to end(). It is illegal to try to dereference an iterator equal to end, so make this check first. Dereferencing an iterator accesses the value type stored in the hash, which is in fact the same type used in the STL map - a std::pair containing a const key type and a data type. Dereferencing of iterators will be dealt with in the section on iterators.
const T& operator[] (const K& key) const; T& operator[] (const K& key);
The index operators take the key as an argument and return the data type. Thus for our example, they take a string as an argument and return an int.
The first form is for use on const hashes and is a read-only indexing operator. If you try to access a key that is not present, an assertion failure will happen. If the key is present, then the data associated with it is returned as a const so that it cannot be modified.
The second form is for use on non-const hashes and can be used to read from or write to the hash. The behaviour is much like the [] operator for map. If the key is not present, then a new element is created with that key and a data field initialised with its default constructor. The data field is then returned. If the key is present, its data field is simply returned. Since the data is returned by reference, it can then be used as the target of an assignment to set the data field to a different value. For example:
table["one"] = 1; table["two"] = table["one"] + 1;
It is important to remember that, as with the map, the index operator can change the contents of the hash. If you don't want this, either guard against it using present() or avoid the issue completely by using find().
Iterators are used to access key/data pairs in the hash. Thus an iterator is used as the result of the find() function so that the elements can then be accessed as a pair via the iterator. Iterators can also be used to access all elements in the hash in turn.
An iterator acts like a pointer to the hash's value_type which is a pair:
typedef std::pair<const K, T> value_type;
The key is accessed through the pair's 'first' element and the data through the second. For example:
hash<string,int,hash_string>::iterator six = table.find("six");
if (six == table.end())
std::cout << "key \"six\" not found in table" << std::endl;
else
std::cout << "key \"six\" refers to " << six->first << " = " << six->second << std::endl;
Note how the iterator is dereferenced and then the pair's fields are accessed in the final line.
It is legal to assign to the data field via an iterator (i.e. six->second), but not the key. If the key were to be changed, that would change the hashing value and so the hash would be broken. Therefore it is disallowed by making that field const.
An iterator traversal of the hash is controlled by two functions with the same naming conventions as in STL classes:
const_iterator begin(void) const; iterator begin(void); const_iterator end(void) const; iterator end(void);
These functions provide both const and non-const functions of the begin() and end() functions that can be used to visit all elements in a hash. The const form (using a const_iterator) should be used on a const hash, whilst the non-const form (using an iterator) should be used on a non-const hash. You need to match the type of iterator to the const-ness of the hash.
In use the iterators are much the same as the map iterators:
for (hash<string,int,hash_string>::iterator i = table.begin(); i != table.end(); i++)
{
std::cout << "Key: " << i->first << ", data = " << i->second << std::endl;
}
The main difference between the hash and the map is that the hash will visit the elements in no particular order since there is no concept of a sort order for a hash, whereas the map is sorted. You should therefore not rely on any ordering when using iterators in this way.
The hash's iterators are unidirectional - that is, you can only traverse the elements in a forward direction from begin() to end(). There is only a "++" operator and not a "--" operator for an iterator and there are no reverse iterators. It seems to me that since there is no such thing as a sort ordering on a hash, there's no need for any other iteration order.
STLplus iterators are slightly different to STL iterators in that they know which container they belong to. When an iterator is first declared, but not assigned to, it is known as a null iterator. Then, when it is assigned to so that it points to a node in the tree, it is known as a valid iterator. If the iterator is made to point off the end of the hash - for example by incrementing to the value returned by the end() method above, then it becomes an end iterator.
This concept is similar to STL iterators like the list iterator, where an iterator that walks off the end of the list becomes an end iterator. However, the STL has no concept of a null iterator and therefore no direct equivalent of a valid iterator.
The iterator has a set of functions for testing these conditions:
bool iterator::null(void) const; bool iterator::end(void) const; bool iterator::valid(void) const;
Note that these functions are members of the iterator class, not the hash class.
Exactly one of these conditions will always be true, since the iterator must be one of null, end or valid.
Only a valid iterator can be dereferenced to access the element pair pointed to by the iterator. Any attempt to dereference an end or null iterator will result in an exception being thrown. See the section below on Exceptions to see which.
A hash table is very sensitive to the hashing function and the number of bins in the hash. To help diagnose performance problems with the hash class, there is a diagnostic print routine with three forms: a member function and two friend functions:
void debug_report(otext&, unsigned indent = 0) const; template<typename K, typename T, class H, class E> otext& print(otext& str, const hash& table, unsigned indent = 0); template<typename K, typename T, class H, class E> otext& operator << (otext& str, const hash& table);
These functions print out a table showing the number of elements in each bin and some summary statistics. The table looks like this:
------------------------------------------------------------------------ | size: 7 | bins: 20 | loading: 0.35 auto-rehash | occupied: 7 (35.0%), min = 0, max = 1 |----------------------------------------------------------------------- | bin 0 1 2 3 4 5 6 7 8 9 | --------------------------------------------------------------- | 0 | 1 . 1 1 . 1 . . . . | 10 | . . . 1 . . . 1 1 . ------------------------------------------------------------------------
There are three exceptions that can be thrown by hash, all indicating a misuse of the iterators:
* or ->
operators.These exceptions are common to all STLplus iterators and are explained in the STLplus exceptions policy.