persistent - A Data Persistence Subsystem

Introduction

Persistence is the ability to dump a data structure to disk and then restore it again later either in the same run of the program, a later run of a program, or even in a different program. It is an easy way to save a program's state or to communicate information in a structural form between programs.

In fact, persistence is not limited to disk dumps, since the same idea can be used to transfer information from one program to another down a pipe or even an Internet connection. In effect you can use persistence to communicate a data structure of any complexity between two programs even if they are running on different computers under different operating systems.

At a more basic level, persistence does away with the need to design file formats. Instead, just design a data structure to carry the information required between the programs and then make that data structure persistent. The file format is designed for you by the persistence subsystem.

The persistent format is a binary format so is extremely efficient both in data size and in CPU time required. For example, text formats tend to be dominated by the processing required to convert integer valued between their machine form (2's-complement binary) and the text form (sign-magnitude decimal). This problem doesn't occur with the persistence format which dumps and restores in the native binary form.

The purpose of the data persistence subsystem is to provide a toolkit which makes it easy if not trivial to make a data structure persistent. However, it is not totally automatic - C++ is too flexible a language to be able to take any data structure and just dump it. This is why the approach has been to provide a toolkit out of which persistence routines can be written.

The toolkit provides a set of functions for dumping and restoring a wide range of types. All the basic C types are made persistent, as are C++ types like string and complex. However, the real power of the persistence functions is that template functions are provided for making all of the STL and STLplus container classes persistent.

The idea is that a container is made persistent by dumping its contents using a dump routine for the contained data type. For example, a vector of strings is dumped by dumping vector-specific information and then repeatedly calling the dump routine for string. The restore function restores the vector and then repeatedly calls the restore function for string to restore the vector's contents.

The same concept is applied to all the container classes. Therefore, to make a container persistent, all you have to do is supply dump and restore functions for the contained type. If the contained type is a basic C or C++ type, then these functions are already provided and the data structure is already persistent.

Infrastructure

The dump operation is controlled by an object of type dump_context which is defined in persistent.hpp. This is initialised with an output device (any derivative of otext) and is then passed down through the hierarchy of dump routines. At the end of the dump, the output device can be checked to see if an output error occurred. Here's a typical example of how to dump a data structure:

oftext output(filename);
dump_context dumper(output);
dump(dumper, data);

In this example, you can see how an output file is created (oftext is an output file device - see fileio). Then the dump_context object is initialised with this output device. Then the dump function for the data structure is called. The output device should be closed, but this will be done by its destructor.

Similarly, a restore operation is controlled by a restore_context object. This is initialised with an input device. Here's an example of how to restore a data structure:

iftext input(filename);
restore_context restorer(input);
restore(restorer, data);

The TextIO device must be in binary mode for persistence to work correctly. The context object automatically places the TextIO device into binary mode when it is passed to the context's constructor, so you don't have to worry about that issue. Be careful however to ensure that binary mode is used if you transmit dumped data over networks - some programs (such as FTP) may try to convert data that looks like line endings to the 'correct' form for the operating system - corrupting the persistent data irretrievably.

Persistence of Basic Types

To start with, I'll demonstrate how to dump and restore a simple data type containing only simple C types. The following class will be used for the demonstration:

class point
{
private:
  int m_x;
  int m_y;
  int m_z;
public:
...
}

The required parameter profile of the dump/restore functions is:

void dump(dump_context&, const type&);
void restore(restore_context&, type&);

These functions should be declared as stand-alone functions and not methods. In this case this will be done by making them friends of the class, meaning they are not methods but can access the data members even though the members are declared as private.

So, here is the point class with the persistence functions' declarations added:

class point
{
private:
  int m_x;
  int m_y;
  int m_z;
public:
...
  friend void dump(dump_context& context, const point& pt);
  friend void restore(restore_context& context, point& pt);
}

The dump and restore functions are written using the existing dump and restore functions for int, the type used for the three dimensions of a point:

void dump(dump_context& context, const point& pt)
{
  dump(context,pt.m_x);
  dump(context,pt.m_y);
  dump(context,pt.m_z);
}

void restore(restore_context& context, point& pt)
{
  restore(context,pt.m_x);
  restore(context,pt.m_y);
  restore(context,pt.m_z);
}

Note that neither the dump nor the restore actually do any file I/O themselves, it is all delegated to the pre-written functions provided in persistent.hpp for type int.

Persistence of Enumeration Types

Enumeration types are essentially small integers. However, each type is considered to be a different type by the compiler - so therefore they are not actually be treated as simple integer types - you get a compilation error. The solution that I supply is a pair of template functions that adapt themselves to the type of the enum being made persistent. The functions are:

template<typename T>
void dump_enum(dump_context& str, const T& data) throw(persistent_dump_failed);
template<typename T>
void restore_enum(restore_context& str, T& data) throw(persistent_restore_failed);

Consider the following example. The enum defines a traffic light sequence:

enum traffic_lights {red, red_amber, green, amber};

This can be used with dump_enum and restore_enum directly, but is is better style to write dump and restore functions that call the template functions, thus hiding the use of the template:

void dump(dump_context& context, const traffic_lights& lights)
{
  dump_enum(context, lights);
}

void restore(restore_context& context, traffic_lights& lights)
{
  restore_enum(context, lights);
}

Persistence of Multi-Level Types

A real data structure of course has many layers. The persistence functions are designed to be used in a layered way. The dump/restore functions written above can be used stand-alone to dump a single point, but they can also be used to dump a point stored as part of a different data structure. In this way, dump and restore routines can be built up a layer at a time.

The example will represent an edge as two points:

class edge
{
private:
  point m_begin;
  point m_end;
public:
...
  friend void dump(dump_context& context, const edge& pt);
  friend void restore(restore_context& context, edge& pt);
};

Once again the dump/restore functions can be written in terms of the dump/restore functions for the data members:

void dump(dump_context& context, const edge& e)
{
  dump(context,e.m_begin);
  dump(context,e.m_end);
}

void restore(restore_context& context, edge& e)
{
  restore(context,e.m_begin);
  restore(context,e.m_end);
}

In this case, to dump an edge means dumping two points which uses the dump function for the point class written in the last section. This layering can be continued ad infinitum.

Persistence of Templates

The template classes provided by the STL and the template classes provided by STLplus have been made persistent using template dump/restore functions. Because of problems with overloading of template functions in Visual C++, the functions are actually called dump_class and restore_class where class is the name of the template class. For example, the persistence functions for the STL map are called dump_map and restore_map.

The persistence functions for templates are themselves templates, so are automatically adapted to the type that the container holds. For example, dump_vector which is the dump routine for the STL vector, will adapt to the type being held in the vector. If the vector contains int, then the dump_vector function will dump ints by calling the dump function defined for int. If the vector contains edges (defined in the last section) then the dump_vector function will dump edges. The template function requires that there is a function called dump for the element type of the vector. If there isn't one already, you need to write one.

To demonstrate, a vector of edges will be used. In this case we need a dump function for a single edge. This has already been written in the last section. Therefore, the dump function for a vector of edges is very simple to write, as is the restore function:

void dump(dump_context& context, const vector<edge>& e)
{
  dump_vector(context, e);
}

void restore(restore_context& context, vector<edge>& e)
{
  restore_vector(context, e);
}

Persistence of Iterators

I have not been able to implement a general solution to the problem of persistent iterators for STL templates. However, I have added persistence to the iterators for the STLplus template classes ntree and digraph.

The ntree class has three types of iterator - a simple iterator and the traversal iterators prefix_iterator and postfix_iterator. All of these have been made pwersistent by the addition of template dump and restore functions:

// simple iterators
template<typename T, typename TRef, typename TPtr>
void dump_ntree_iterator(dump_context&, const ntree_iterator<T,TRef,TPtr>&)
  throw(persistent_dump_failed);

template<typename T, typename TRef, typename TPtr>
void restore_ntree_iterator(restore_context&, ntree_iterator<T,TRef,TPtr>&)
  throw(persistent_restore_failed);

// prefix iterators
template<typename T, typename TRef, typename TPtr>
void dump_ntree_prefix_iterator(dump_context&, const ntree_prefix_iterator<T,TRef,TPtr>&)
  throw(persistent_dump_failed);

template<typename T, typename TRef, typename TPtr>
void restore_ntree_prefix_iterator(restore_context&, ntree_prefix_iterator<T,TRef,TPtr>&)
  throw(persistent_restore_failed);

// postfix iterators
template<typename T, typename TRef, typename TPtr>
void dump_ntree_postfix_iterator(dump_context&, const ntree_postfix_iterator<T,TRef,TPtr>&)
  throw(persistent_dump_failed);

template<typename T, typename TRef, typename TPtr>
void restore_ntree_postfix_iterator(restore_context&, ntree_postfix_iterator<T,TRef,TPtr>&)
  throw(persistent_restore_failed);

As with other template classes, the convention is to write dump/restore functions for a specific template instantiation in terms of these template functions. For example, given a tree of strings, the following functions would be used to make the iterator persistent:

void dump(dump_context& context, const ntree<string>::iterator& i)
{
  dump_ntree_iterator(context, i);
}

void restore(restore_context& context, ntree<string>::iterator& i)
{
  restore_ntree_iterator(context, i);
}

There is a restriction: the ntree must be dumped before any iterators are dumped - if not, an exception will be thrown.

Similarly, digraph node and arc iterators are made persistent by the following template functions:

// node iterators
template<typename NT, typename AT, typename NRef, typename NPtr>
void dump_digraph_iterator(dump_context& str, const digraph_iterator<NT,AT,NRef,NPtr>& data)
  throw(persistent_dump_failed);

template<typename NT, typename AT, typename NRef, typename NPtr>
void restore_digraph_iterator(restore_context& str, digraph_iterator<NT,AT,NRef,NPtr>& data)
  throw(persistent_restore_failed);

// arc iterators
template<typename NT, typename AT, typename NRef, typename NPtr>
void dump_digraph_arc_iterator(dump_context& str, const digraph_arc_iterator<NT,AT,NRef,NPtr>& data)
  throw(persistent_dump_failed);

template<typename NT, typename AT, typename NRef, typename NPtr>
void restore_digraph_arc_iterator(restore_context& str, digraph_arc_iterator<NT,AT,NRef,NPtr>& data)
  throw(persistent_restore_failed);

There same restriction applies: the digraph must be dumped before any iterators are dumped - if not, an exception will be thrown.

Persistence of Simple Pointers

Pointers are a special problem for persistent data types because there may be more than one pointer to the same object in a data structure. If this was dumped in a naive way, there would be two identical copies of the object in the dump, rather than one object and two pointers to it. It would also be impossible to dump a structure with back pointers because the dump mechanism would get stuck in an infinite recursion. The key to dumping such a structure is to determine the primary structure and dump that. Then, in a second pass, dump secondary links such as back pointers and cross links.

The dump function for a pointer will dump the contents of the pointer on the first visit to the object along with a unique magic key that identifies the object pointed to. The second time the object is visited (for example due to a back pointer), only the magic key is dumped. On restoration, on restoring the object itself, the object is added to a map along with its magic key. When the magic key is found again in the input stream, it is converted by the map into a pointer to the restored object.

The importance of dumping the primary structure first should be clear from this - dumping the primary links first causes the data structures to be dumped in this pass since each object will be visited for the first time. When back pointers or cross pointers are dumped, all the objects they are pointing to have already been dumped so only magic keys get dumped.

There is a template function pair dump_pointer/restore_pointer which implements this algorithm. It assumes that a pointer points to a single object (for example, an int* points to an int). Pointers to arrays of objects cannot be supported in this way and will need to be hand-implemented (you need to know the size of the array as well to be able to dump it). You should be using vectors anyway!

The one exception is char* which is treated as a null terminated array and not a pointer to a single char. The char* persistence functions are not templates and so have the simple names dump/restore. Multiple pointers to the same char array will be dumped once and the same magic key method used as for pointers to other types.

Thus, dump_pointer will dump a magic key to the file and then, if this is the first visit, it will call dump on the object being pointed to. You need to provide that dump function if it doesn't already exist. Similarly, the restore function restores the magic key, checks to see if it is a new key and if it is it restores the contents of the pointer. If it is an already-restored key, then it is simply mapped onto its target object.

Persistence of Smart Pointers

The STLplus smart pointer classes are a special case of template container classes in that they contain pointers to objects, whereas most template containers contain objects themselves. Therefore, persistence of smart pointers is implemented by calling the persistence functions for pointers.

There are two interpretations of pointers though: a simple pointer to an object of a known type and a polymorphic pointer which has the type of a pointer to a superclass but which can in fact point to any subclass of the pointer type. These two interpretations are handled by two variants of the smart pointer classes: the smart_ptr variant is intended for use with simple pointers and so uses the persistence functions for simple pointers (see Persistence of Pointers), whereas the smart_ptr_clone variant, which is designed for pointing to polymorphic types, uses the persistence for polymorphic pointers (see Persistence of Polymorphic Classes using Interfaces).

Persistence of Polymorphic Classes using Interfaces

In C++ you can define a superclass and then derive subclasses from it (some people prefer the terminology base class for superclass and derived class for subclass). This set of classes based on a common superclass is referred to as a set of Polymorphic classes.

Polymorphic classes are manipulated through a pointer to the superclass. The pointer can then point to any object of any subclass of the common superclass. Subclass-specific operations are provided through the use of virtual functions.

If this is still making no sense, you need to read a book on C++ since the purpose of this document is to explain the STLplus, not to teach C++ basics. Otherwise, the rest of this section is on how to make Polymorphic classes persistent.

Polymorphic classes represent a problem for persistence. So far all the persistence functions have used knowledge of the exact type of the object at compile time to select the correct overloaded dump or restore function. However, with polymorphism, only the superclass is known from the type of the pointer. The actual subclass being pointed to is unknown at compile time and must be determined at run time. This means that run-time type information must be used to determine the type. This is usually achieved by defining virtual methods.

This is the solution used to make polymorphic types persistent - although there is an alternative implementation that uses callback functions instead which is described in the next section.

The set of virtual functions used to make a class persistent is defined by an interface called persistent. To make a polymorphic class persistent, the first stage is to derive the base class of your family of polymorphic classes from this interface.

class base : public persistent

The persistent interface defines two abstract methods that you must provide for all subclasses to be made persistent:

class persistent : public clonable
{
public:
  virtual void dump(dump_context&) const throw(persistent_dump_failed) = 0;
  virtual void restore(restore_context&)  throw(persistent_restore_failed) = 0;
};

However, you can see that this in turn inherits the clonable interface which allows copying of polymorphic types:

class clonable
{
public:
  virtual clonable* clone(void) const = 0;
};

This method is also required by the smart_ptr_clone container which is also used to store polymorphic classes, so once you've made a class persistent, you've automatically made it suitable for use in this smart pointer.

In order to demonstrate the way polymorphic classes are made persistent, consider the following noddy example:

class base
{
  int m_value;
public:
  base(int value = 0) : m_value(value) {}
  virtual ~base(void) {}

  virtual int value (void) const {return m_value;}
  virtual void set(int value = 0) {m_value = value;}
};

class derived : public base
{
  string m_image;
public:
  derived(int value = 0) : base(value), m_image(to_string(value)) {}
  derived(string value = string()) : base(to_int(value)), m_image(value) {}
  virtual ~derived(void) {}

  virtual void set(int value = 0) {m_image = to_string(value); base::set(value);}
};

In order to make these two classes persistent, the base class must inherit from the persistent interface and then both classes must have the three abstract methods clone, dump and restore added.

Here's these classes with the additions:

class base : public persistent
{
  int m_value;
public:
  base(int value = 0) : m_value(value) {}
  virtual ~base(void) {}

  virtual int value (void) const {return m_value;}
  virtual void set(int value = 0) {m_value = value;}

  clonable* clone(void) const
    {
      return new base(*this);
    }
  void dump(dump_context& context) const throw(persistent_dump_failed)
    {
      ::dump(context,m_value);
    }
  void restore(restore_context& context) throw(persistent_restore_failed)
    {
      ::restore(context,m_value);
    }
};

class derived : public base
{
  string m_image;
public:
  derived(int value = 0) : base(value), m_image(to_string(value)) {}
  derived(string value) : base(to_int(value)), m_image(value) {}
  virtual ~derived(void) {}

  virtual void set(int value = 0) {m_image = to_string(value); base::set(value);}

  clonable* clone(void) const
    {
      return new derived(*this);
    }
  void dump(dump_context& context) const throw(persistent_dump_failed)
    {
      base::dump(context);
      ::dump(context,m_image);
    }
  void restore(restore_context& context) throw(persistent_restore_failed)
    {
      base::restore(context);
      ::restore(context,m_image);
    }
};

Note the use of a common trick here. The subclass derived dumps its superclass by simply calling the superclass's dump method (in this case, base::dump). This is in keeping with the general C++ convention that subclasses should not use knowledge of the internals of the superclass. This convention is easy to follow: call the dump/restore method of the immediate superclass of the subclass first, then dump/restore the subclass-specific data.

The solution for persistence of Polymorphic classes requires that every derivative class be registered with the dump_context or restore_context before the dump or restore operation commences. Furthermore, where there are many polymorphic types being handled, the order of registration must be the same for the restore operation as it was for the dump operation.

Consider first the dump operation. The dump_context class provides the following method for registration:

  unsigned short dump_context::register_interface(const std::type_info& info);

This is called once for each polymorphic type to be dumped. So, for the example above it is called twice:

    dump_context context(output);

    context.register_interface(typeid(base));
    context.register_interface(typeid(derived));

The typeid operator is built-in to C++ and provides a means of getting the type name from a type or expression as a char*. This is mapped internally onto a magic key which is an integer value unique to that subclass. The return value of the register_type method is the magic key for that type and is used in the dump to differentiate between the different classes. There's no real reason for capturing this key except maybe for debugging the data stream. Keys are allocated in the order of registration of class types. This is why class types must be registered in the same order for both the dump and restore operations.

For the restore operation it is necessary to register a sample object of the class. This is because the restore operation creates objects of the class by cloning the sample. The sample is stored in a smart_ptr_clone:

typedef smart_ptr_clone<persistent> persistent_ptr;

The restore_context class provides the following registration function:

  unsigned short restore_context::register_interface(const persistent_ptr&);

The objects are registered in the same order as the types were registered into the dump context, because it is this ordering that provides the mapping from the unique key used in the dump to the correct sample object used in the restore. During the dump, the class base was registered first, then class derived. The sample objects are therefore registered in the same order for the restore:

    restore_context context(input);

    context.register_interface(base());
    context.register_interface(derived());

An alternative way of registering these interfaces is to wrap their registration up in an installer function. This installer can then be used to install all classes in a single step.

In fact, two installer functions are required - one for dumping and one for restoring. It is up to you to check that these installer functions install their callbacks in the same order. The type profiles for these installer functions are:

void (*dump_context::installer)(dump_context&);
void (*restore_context::installer)(restore_context&);

In other words, the installer type for a dump_context is a pointer to a function that takes a dump_context& and returns void. Similarly the installer type for a restore_context is a pointer to a function that takes a restore_context& and returns void. For the above example they might look like this:

void make_base_persistent(dump_context& context)
{
  context.register_interface(typeid(base));
  context.register_interface(typeid(derived));
}

void make_base_persistent(restore_context& context)
{
  context.register_interface(base());
  context.register_interface(derived());
}

The functions can be called whatever you like, but I prefer to give them the same name and use overload resolution to pick the right one according to the type profile. In use, after creating a dump or restore context, call the method register_all with the above installer as the argument. For example, using the earlier example again, rewritten to use an installer:

    dump_context context(output);
    context.register_all(make_base_persistent);

Now that the classes are registered, the actual dump and restore of a superclass pointer is handled by the following functions:

template<typename T>void dump_interface(dump_context& str, const T*& data);
template<typename T>void restore_interface(restore_context& str, T*& data);

For example, given the above example using classes base and derived, specific dump and restore functions can be written that simply call the above template functions:

void dump(dump_context& context, const base*& ptr)
{
  dump_interface(context,ptr);
}

void restore(restore_context& context, base*& ptr)
{
  restore_interface(context,ptr);
}

Note: since polymorphic types are handled in C++ via pointers, the same behaviour is implemented for multiple pointers to the same object as was implemented for simple pointers. When two pointers to the same object are dumped, they will be restored as pointers to the same object.

Alternatively, a smart_ptr_clone can be used. This class is specifically designed to point to a polymorphic type which uses the clonable interface. Furthermore, the persistence functions for smart_ptr_clone call the persistence functions for polymorphic types using the clonable interface. For example, say you have the following type declarations:

typedef smart_ptr_clone<base> base_ptr;
typedef vector<base_ptr> base_vector;

These types can be made persistent in the usual way, by creating layers of functions called dump and restore building up from the low-level contained type to the composite type by calling the template functions for vector and smart_ptr_clone.

We already have persistence of base* handled by the callbacks installed above. To support smart_ptr_clone<base> which contains a base* is simply a case of writing a function that calls the template dump/restore for the smart pointer class:

void dump(dump_context& context, const base_ptr& ptr)
{
  dump_smart_ptr_clone(context,ptr);
}

void restore(restore_context& context, base_ptr& ptr)
{
  restore_smart_ptr_clone(context,ptr);
}

The final stage is to make a vector of these persistent:

void dump(dump_context& context, const base_vector& vec)
{
  dump_vector(context,vec);
}

void restore(restore_context& context, base_vector& vec)
{
  restore_vector(context,vec);
}

Persistence of Polymorphic Classes using Callbacks

The previous section described how polymorphic types could be made persistent in an object-oriented way through inheritance and virtual methods. However, it is not always possible to use this approach. For example, you might want to make a class persistent that you cannot change. Therefore an alternative solution is needed that uses a non-intrusive approach to persistence. In order to achieve this non-intrusive approach, I have provided the option to use dump and restore callbacks to perform the persistence functionality and not virtuals. The callbacks are associated with the subclass, which can be determined at run time. The callbacks are stored in the dump_context object during the dump and in the restore_context object during a restore.

However, this is still not a complete solution. During restore, it is necessary to create an object of the right subclass before its restore callback can be called. There is no concept of a virtual constructor in C++, nor is there a means of creating an object of any type from, say, the name of the type. The solution uses create callbacks rather than sample objects. A create callback is a function that, when called, creates an object and returns a pointer to it. In order to make the method as general as possible, the create callback returns this pointer as a void*.

Thus, the non-intrusive solution to persistence of polymorphic types requires no changes to existing classes - no extra virtual functions for example. However, the cost of this solution is that it does require three callback functions to be written for each subclass to be made persistent.

In order to demonstrate the way polymorphic classes are made persistent, consider the following noddy example:

class base
{
  int m_value;
public:
  base(int value = 0) : m_value(value) {}
  virtual ~base(void) {}

  virtual int value (void) const {return m_value;}
  virtual void set(int value = 0) {m_value = value;}
};

class derived : public base
{
  string m_image;
public:
  derived(int value = 0) : base(value), m_image(to_string(value)) {}
  derived(string value = string()) : base(to_int(value)), m_image(value) {}
  virtual ~derived(void) {}

  virtual void set(int value = 0) {m_image = to_string(value); base::set(value);}
};

In order to make these two classes persistent, each one must have three callbacks added. These callbacks can be completely separate from the classes if it is not possible to change the class definitions, but typically it is easier to add the functions as friends of the class so that they have direct access to the data fields. The three functions are the create, dump and restore callbacks. The convention is to call them create_class, dump_class and restore_class, where class is the name of the class that they act on.

The parameter profiles of the three callbacks is:

void dump_class(dump_context& context, const void* data)
void* create_class(void)
void restore_class(restore_context& context, void*& data)

For this example, these functions are added to the classes as friends:

class base
{
  ...
  friend void dump_base(dump_context& context, const void* data)
    {
      dump(context,((const base*)data)->m_value);
    }
  friend void* create_base(void)
    {
      return new base;
    }
  friend void restore_base(restore_context& context, void*& data)
    {
      restore(context,((base*)data)->m_value);
    }
};

class derived
{
  ...
  friend void dump_derived(dump_context& context, const void* data)
    {
      dump_base(context,data);
      const derived* derived_data = (const derived*)data;
      dump(context,derived_data->m_image);
    }
  friend void* create_derived(void)
    {
      return new derived;
    }
  friend void restore_derived(restore_context& context, void*& data)
    {
      restore_base(context,data);
      derived* derived_data = (derived*)data;
      restore(context,derived_data->m_image);
    }
};

Note the use of a common trick here. The subclass derived dumps its superclass by simply calling the superclass's callback (in this case, dump_base). This is in keeping with the general C++ convention that subclasses should not use knowledge of the internals of the superclass. This convention is easy to follow: call the dump/restore callback of the immediate superclass of the subclass first, then dump/restore the subclass-specific data.

The solution for persistence of Polymorphic classes requires that every polymorphic class be registered with the dump_context or restore_context before the dump or restore operation commences. Furthermore, where there are many polymorphic types being handled, the order of registration must be the same for the restore operation as it was for the dump operation.

Consider first the dump operation. The dump_context class provides the following method for registration:

  unsigned short dump_context::register_type(const std::type_info& info, dump_callback);

This is called once for each polymorphic type to be dumped. So, for the example above it is called twice:

    dump_context context(output);

    context.register_type(typeid(base),dump_base);
    context.register_type(typeid(derived),dump_derived);

The typeid operator is built-in to C++ and provides a means of getting the type name from a type or expression as a char*. This is mapped internally onto a magic key which is an integer value unique to that subclass. The return value of the register_type method is the magic key for that type and is used in the dump to differentiate between the different classes. There's no real reason for capturing this key except maybe for debugging the data stream. Keys are allocated in the order of registration of class types. This is why class types must be registered in the same order for both the dump and restore operations.

For the restore operation it is necessary to register both a create callback and a restore callback with the restore context. The restore_context class provides the following registration function:

  unsigned short restore_context::register_type(create_callback,restore_callback);

The callbacks are registered in the same order as the types were registered into the dump context, because it is this ordering that provides the mapping from the unique key used in the dump to the correct create callback used in the restore. During the dump, the class base was registered first, then class derived. The callbacks are therefore registered in the same order for the restore:

    restore_context context(input);

    context.register_type(create_base,restore_base);
    context.register_type(create_derived,restore_derived);

An alternative way of registering these callbacks is to wrap their registration up in an installer function. This installer can then be used to install all callbacks in a single step.

In fact, two installer functions are required - one for dumping and one for restoring. It is up to you to check that these installer functions install their callbacks in the same order. The type profiles for these installer functions are:

void (*dump_context::installer)(dump_context&);
void (*restore_context::installer)(restore_context&);

In other words, the installer type for a dump_context is a pointer to a function that takes a dump_context& and returns void. Similarly the installer type for a restore_context is a pointer to a function that takes a restore_context& and returns void. For the above example they might look like this:

void make_base_persistent(dump_context& context)
{
  context.register_type(typeid(base),dump_base);
  context.register_type(typeid(derived),dump_derived);
}
void make_base_persistent(restore_context& context)
{
  context.register_type(create_base,restore_base);
  context.register_type(create_derived,restore_derived);
}

The functions can be called whatever you like, but I prefer to give them the same name and use overload resolution to pick the right one according to the type profile. In use, after creating a dump or restore context, call the method register_all with the above installer as the argument. For example, using the earlier example again, rewritten to use an installer:

    dump_context context(output);
    context.register_all(make_base_persistent);

Now that the callbacks are registered, the actual dump and restore of a superclass pointer is handled by the following functions:

template<typename T>void dump_polymorph(dump_context& str, const T*& data);
template<typename T>void restore_polymorph(restore_context& str, T*& data);

For example, given the above example using classes base and derived, specific dump and restore functions can be written that simply call the above template functions:

void dump(dump_context& context, const base*& ptr)
{
  dump_polymorph(context,ptr);
}

void restore(restore_context& context, base*& ptr)
{
  restore_polymorph(context,ptr);
}

Note: since polymorphic types are handled in C++ via pointers, the same behaviour is implemented for multiple pointers to the same object as was implemented for simple pointers. When two pointers to the same object are dumped, they will be restored as pointers to the same object.

Shortcut Functions

There is a set of template functions defined in the persistence.hpp header that encapsulate a common use of persistence. The functions assume that you have built up your family of dump and restore functions so that an entire data structure can be dumped by simply calling a function called dump at the top level. Similarly the data structure can be restored by simply calling restore at the top level. The shortcut functions also support the use of an installer function as described in the section on Polymorphic types. This reduces the process of dumping to common targets to a one-line function call.

File-Based Persistence

Probably the most useful shortcut functions are the pair dump_to_file/restore_from_file:

template<typename T>
void dump_to_file(const T& source, const std::string& filename, dump_context::installer installer)
  throw(persistent_dump_failed,persistent_illegal_type);
template<typename T>
void restore_from_file(const std::string& filename, T& result, restore_context::installer installer)
  throw(persistent_restore_failed,persistent_illegal_type);

To dump a data structure to a file, simply call dump_to_file with the first argument being the source data structure to be dumped, the second argument being the name of the file to dump to and the final argument being an installer function for registering any polymorphic types. The last argument can be null if there are no polymorphic types to register.

Similarly, to restore the same data structure, simply call restore_from_file with the name of the file as the first argument (conceptually, the first argument is the source and the second the destination) and the data structure to be restored as the second. Again the third argument is an installer function for restoring polymorphic types and may be null.

Here's an example that dumps and restores a vector of string to and from a file. First, I need to write a dump/restore pair of functions that make a vector of string persistent:

void dump(dump_context& context, const vector<string>& data)
{
  dump_vector(context, data);
}

void restore(restore_context& context, vector<string>& data)
{
  restore_vector(context, data);
}

Now here's a trivial application that takes the command-line arguments represented by argv and puts them into a vector of strings, then dumps them to a file:

int main (unsigned argc, char* argv[])
{
  if (argc == 1)
    ferr << "usage: " << argv[0] << " <strings>" << endl;
  else
  {
    vector<string> source;
    for (unsigned i = 1; i < argc; i++)
      source.push_back(string(argv[i]));
    dump_to_file(source, "strings.dat", 0);
  }
  return 0;
}

Here's a complementary application that restores the file and prints the results to standard output:

int main (unsigned argc, char* argv[])
{
  if (argc != 1)
    ferr << "usage: " << argv[0] << endl;
  else
  {
    vector<string> copy;
    restore_from_file("strings.dat", copy, 0);
    fout << "restored text: " << vector_to_string(copy, ",") << endl;
  }
  return 0;
}

String-Based Persistence

Sometimes you want to create an in-memory dump of a data structure rather than dumping to a file. For example, this would be a starting point for a routine for transferring a data structure across the internet using data persistence as the mechanism. This is done by dumping to and restoring from a string:

template<typename T>
void dump_to_string(const T& source, std::string& result, dump_context::installer installer)
  throw(persistent_dump_failed,persistent_illegal_type);
template<typename T>
void restore_from_string(const std::string& source, T& result, restore_context::installer installer)
  throw(persistent_restore_failed,persistent_illegal_type);

This is very similar to the previous section's file-based persistence, except that the target of the dump is the string itself. Note that the std::string class is capable of storing binary data since it does not rely on null termination to work properly. A C char* could not be used in this way (but its obsolete anyway, so no worries mate).

To dump a data structure to a string, simply call dump_to_string with the first argument being the source data structure to be dumped, the second argument being the string to dump to and the final argument being an installer function for registering any polymorphic types. The last argument can be null if there are no polymorphic types to register.

Similarly, to restore the same data structure, simply call restore_from_string with the string containing the dumped data as the first argument and the data structure to be restored as the second. Again the third argument is an installer function for restoring polymorphic types and may be null.

To illustrate this, I'll use the same example as above for file-based persistence. This example dumps and restores a vector of string to and from a string. Since I've already written the dump/restore pair of functions for the previous example, there's no need to do it again.

Now here's a trivial application that takes the command-line arguments represented by argv and puts them into a vector of strings, then dumps them to a string, restores them from that string and finally compares them to confirm that the two data structures are identical:

int main (unsigned argc, char* argv[])
{
  if (argc == 1)
    ferr << "usage: " << argv[0] << " <strings>" << endl;
  else
  {
    vector<string> source;
    for (unsigned i = 1; i < argc; i++)
      source.push_back(string(argv[i]));
    string binary;
    dump_to_string(source, binary, 0);
    vector<string> copy;
    restore_from_string(binary, copy, 0);
    if (source != copy)
      ferr << "ERROR - restored data is different" << endl;
    else
      ferr << "success - restored data is the same" << endl;
  }
  return 0;
}

TextIO-Based Persistence

The above two short-cuts are in fact specialisations of the most general short-cut functions that dump to and restore from any TextIO device. This more general form is useful if you want to use other I/O devices than the most common ones of files and in-memory strings.

The functions are:

template<typename T>
void dump_to_device(const T& source, otext& result, dump_context::installer installer)
  throw(persistent_dump_failed,persistent_illegal_type);
template<typename T>
void restore_from_device(itext& source, T& result, restore_context::installer installer)
  throw(persistent_restore_failed,persistent_illegal_type);

To dump a data structure to an output device, simply call dump_to_device with the first argument being the source data structure to be dumped, the second argument being the device to dump to and the final argument being an installer function for registering any polymorphic types. The last argument can be null if there are no polymorphic types to register.

Similarly, to restore the same data structure, simply call restore_from_device with the device containing the dumped data as the first argument and the data structure to be restored as the second. Again the third argument is an installer function for restoring polymorphic types and may be null.

To illustrate this, here's the bodies of the dump_to_file and restore_from_file functions described earlier to show how they have in fact been implemented as calls to these two general-purpose functions:

template<typename T>
void dump_to_file(const T& source, const std::string& filename, dump_context::installer installer)
  throw(persistent_dump_failed,persistent_illegal_type)
{
  oftext output(filename);
  dump_to_device(source, output, installer);
}

template<typename T>
void restore_from_file(const std::string& filename, T& result, restore_context::installer installer)
  throw(persistent_restore_failed,persistent_illegal_type)
{
  iftext input(filename);
  restore_from_device(input, result, installer);
}

So, dump_to_file is implemented by creating a file output device (class oftext) and then calling dump_to_device. You can implement your own dump_to_xxx and restore_from_xxx functions by simply implementing TextIO devices for input from and output to xxx.

Include Files

The persistence functions for basic C types and STL containers are defined in persistence.hpp. as are the infrastructure classes dump_context/restore_context. The STLplus container classes have the persistence functions built-in.

The following table details which types have persistence, where the persistence functions are defined (i.e. which header to include) and the names of the functions. In this table, uppercase characters such as T are used to represent template argument types, so that vector<T> means any vector, with type T as its template parameter.

Persistence Functions
TypeLibraryIncludeFunction Names
char C persistent.hpp dump/restore
signed char C persistent.hpp dump/restore
unsigned char C persistent.hpp dump/restore
short C persistent.hpp dump/restore
unsigned short C persistent.hpp dump/restore
int C persistent.hpp dump/restore
unsigned C persistent.hpp dump/restore
long C persistent.hpp dump/restore
unsigned long C persistent.hpp dump/restore
inf STLplusinf.hpp dump/restore
enum{} C persistent.hpp dump_enum/restore_enum
float C persistent.hpp dump/restore
double C persistent.hpp dump/restore
T* (simple) C/C++ persistent.hpp dump_pointer/restore_pointer
T* (polymorphic) C++ persistent.hpp dump_polymorph/restore_polymorph
smart_ptr<T> STLplussmart_ptr.hpp dump_smart_ptr/restore_smart_ptr
smart_ptr_clone<T> STLplussmart_ptr.hpp dump_smart_ptr_clone/restore_smart_ptr_clone
char* C persistent.hpp dump/restore
string STL persistent.hpp dump/restore
basic_string<T> STL persistent.hpp dump_basic_string/restore_basic_string
bitset<N> STL persistent.hpp dump_bitset/restore_bitset
complex<T> STL persistent.hpp dump_complex/restore_complex
deque<T> STL persistent.hpp dump_deque/restore_deque
list<T> STL persistent.hpp dump_list/restore_list
vector<T> STL persistent.hpp dump_vector/restore_vector
pair<T1,T2> STL persistent.hpp dump_pair/restore_pair
triple<T1,T2,T3> STLplustriple.hpp dump_triple/restore_triple
foursome<T1,T2,T3,T4>STLplusfoursome.hpp dump_foursome/restore_foursome
hash<K,T,H,E> STLplushash.hpp dump_hash/restore_hash
map<K,T> STL persistent.hpp dump_map/restore_map
multimap<K,T> STL persistent.hpp dump_multimap/restore_multimap
set<T> STL persistent.hpp dump_set/restore_set
multiset<T> STL persistent.hpp dump_multiset/restore_multiset
digraph<N,A> STLplusdigraph.hpp dump_digraph/restore_digraph
matrix<T> STLplusmatrix.hpp dump_matrix/restore_matrix
ntree<T> STLplusntree.hpp dump_ntree/restore_ntree

Excluded Classes

Note that I have not done the container adaptors queue, priority_queue and stack because their interfaces are too restricted to allow dump and restore routines to be written without burgling the data structure. This means that I will never do them because it is impossible!

When designing a data structure to be made persistent, you need to bear this in mind and use containers such as vector and list rather than queue or stack.

I also haven't implemented any STL iterators. The design of iterators makes it nearly impossible to do this without burgling the data structure, which wouldn't be portable.

Exceptions

The persistence subsystem uses exceptions to indicate errors, in-line with the STLplus exceptions policy.

An exception is thrown since there is no conceivable recovery method that would allow the dump or restore to complete successfully.

The convention is that a dump function throws the persistent_dump_failed exception and the restore function throws the persistent_restore_failed exception if an error is detected in the file format, but that it should keep going where possible.

In addition, if you try to dump or restore a polymorphic type that hasn't had its callbacks registered in advance, the exception persistent_illegal_type will be thrown. The same exception is used for both dump and restore.

The first two exceptions (persistent_dump_failed and persistent_restore_failed) are subclasses of std::runtime_error. The exception persistent_illegal_type is a subclass of std::logic_error to reflect the fact that this can only happen due to a programming error. All are subclasses of std::exception so can be caught by catching this superclass.

Incidental Issues

This section discusses issues that you don't need to know about but which might give useful insights into how persistence works.

Byte Order - Endian-ness

A problem that can occur when communicating between machines is the problem of byte-order. Different machine architectures store data using two different byte orders. This is referred to as Big- and Little-Endian Byte Ordering.

In both conventions, the address of an integer type points to the left end of the word but:

Big-Endian
The most significant byte is on the left end of a word
Little-Endian
The least significant byte is on the left end of a word

Bytes are addressed left to right, so in big-endian order byte 0 is the msB, whereas in little-endian order byte 0 is the lsB. For example, Intel-based machines store data in little-endian byte order so byte 0 is the lsB. Sun Sparc architectures are big-endian, so byte 0 is the msB.

The persistence functions solve the problem of inter-platform communication by always writing integers msB first so that the format is platform-independent.

File Foramt Versions

The concept of file format versions was added for STLplus 1.0. The file format version of a dump is written to the dump file. When the file is restored, the version is the first thing read from the file. The idea is that, if the persistent dump format changes, then the format number changes. This will mean that it is possible to either support old file formats by branching on the format number read from the file, or at least detect them and raise an error if the old format is no longer supported. Also, if an old program tries to read a new format, it will fail but in a way that makes it easy to diagnose the problem.

You do not need to know about these format numbers unless you are personally responsible for writing dump/restore routines and then only if you ever need to change the file format for a particular data type. For example, the introduction of format numbers coincided with a change in the way all integer types are dumped. The old integer format is not supported.

The format version applies to the persistence file format, not the particular layout of your own data structures. If you want that level of fine-grain control, then give your own data structures format numbers as well.

Examples

Single-Level Structures

This example shows how to make a multimap persistent. This is a one-layer data structure because the multimap only contains the basic types int and string (conceptually a string is an atomic type, even if its implementation just happens to be quite complicated - don't confuse implementation with concept).

The example is based on a test program which is used to test the persistence functions. It creates a data structure, dumps it to a file, restores the file into another data structure and then confirms that the two structures are identical.

The following definition is used to define two data structures that map an int onto a string:

  multimap<int,string> data, restored;

The object called data will be used to store the data to be saved in a file, whilst the object called restored will be used to restore the data. It is then possible to compare the two to verify that they are the same.

First, I fill the map with a random amount of random data, just to demonstrate the data persistence:

#define MAX_SIZE 2877
#define MAX_NUM 15254
...
  // seed the random number generator with a different value each run (this is a common trick)
  srand(time(0));
  // select the random map size to generate
  const unsigned number = (unsigned)rand() % MAX_NUM;
  for (unsigned i = 0; i < number; i++)
  {
    // select a random key to add to the map
    int key = rand();
    // select random characters to add to the data string
    const unsigned size = (unsigned)rand() % MAX_SIZE;
    string value;
    for  (unsigned j = 0; j < size; j++)
    {
      char ch = (char)rand();
      value += ch;
    }
    // finally, add the key/data pair to the multimap
    data.insert(make_pair(key,value));
  }

So, the multimap contains random integer keys mapped onto random length strings of random data.

No functions need to be written to implement persistence of this data structure! The pre-defined persistence functions can do the whole job (see the table in the last section). The dump_multimap function dumps the map by calling dump on the key and data types. The key type is int, which already has a dump function defined. The data type is string, which also has a dump function defined.

The first stage in saving this data structure to file is to create a dump_context which needs to be attached to a TextIO output device. In this case I'll choose to save the dump to file:

  oftext out ("test_map.tmp");
  dump_context dumper(out);

Now, the data structure can be dumped to this file:

  dump_multimap(dumper,data);
  out.close();

In this example, the output file is explicitly closed because I'm about to read it straight back in again. To read the file, a restore_context needs to be created:

  iftext in ("test_map.tmp");
  restore_context restorer(in);

Now the data structure can be restored, in this case to a different object:

  restore_multimap(restorer,restored);

The rest of the program just compares the two data structures to confirm they are identical. I don't need to go into that here.

In practice, it is clearer if you do in fact write a trivial pair of functions called dump and restore to hide the use of the template functions. This also means that you can always remember the name of the persistence functions for any type you have designed - because they are always called dump and restore. The functions are:

void dump(dump_context& context, const multimap<int,string>& data)
{
  dump_multimap(context, data);
}

void restore(restore_context& context, multimap<int,string>& data)
{
  restore_multimap(context, data);
}

Multi-level Structures

This example will show how to make a data structure with more than one level of structure persistent.

The example uses a vector of a user-defined class and makes it persistent. It shows how to add persistence functions to a user-defined class so that it can be used with the pre-defined vector persistence functions.

The example requires the following set of includes. The reason for each include will be explained as the example unfolds:

#include <string>
#include <vector>
#include "stlplus.hpp"
using namespace std;

The user-defined data structure is a class for storing email addresses. The class without persistence functions is:

class address
{
private:
  string m_name;
  string m_email;
  int m_age;
public:
  address(void) : m_age(0) { }
  address(const string& name, const string& email, int age) : m_name(name), m_email(email), m_age(age) {}

  const string& name(void) const {return m_name;}
  const string& email(void) const {return m_email;}
  int age(void) const {return m_age;}
};

To add persistence, it is only necessary to add a dump and restore function which use the pre-defined dump and restore for string and int. These are found in the header persistent.hpp. The functions are added to the class as friend functions so that they can access the private data fields directly:

class address
{
  ...
  friend void dump(dump_context& str, const address& data)
    {
      dump(str, data.m_name);
      dump(str, data.m_email);
      dump(str, data.m_age);
    }

  friend void restore(restore_context& str, address& data)
    {
      restore(str, data.m_name);
      restore(str, data.m_email);
      restore(str, data.m_age);
    }
};

The next stage is to define an address book, which is simply an unsorted vector of addresses:

typedef vector<address> address_book;

This type is already persistent - there is a pre-defined pair of template functions dump_vector and restore_vector defined in persistent.hpp. However, it is more consistent to provide overloaded non-template dump and restore functions for the address_book type:

void dump(dump_context& str, const address_book& data)
{
  dump_vector(str, data);
}

void restore(restore_context& str, address_book& data)
{
  restore_vector(str, data);
}

The following test program shows how an address book can be created and dumped, then restored to another address_book object:

int main(unsigned argc, char* argv[])
{
  // create and populate an address book
  address_book addresses;
  addresses.push_back(address("Andy Rushton", "ajr1@ecs.soton.ac.uk", 40));
  addresses.push_back(address("Andrew Brown", "adb@ecs.soton.ac.uk", 85));
  addresses.push_back(address("Mark Zwolinski", "mz@ecs.soton.ac.uk", 21));

  // dump the address book
  oftext out ("test.tmp");
  dump_context dumper(out);
  dump(dumper,addresses);
  out.close();

  // restore the address book to a different object
  address_book restored;
  iftext in ("test.tmp");
  restore_context restorer(in);
  restore(restorer,restored);

  return 0;
}

In this case I'm using persistence to a file, so I've used the FileIO devices oftext and iftext defined in fileio.hpp.

It would be useful to be able to print out the contents of the address book before and after the dump/restore. To do this I'll use the family of print functions defined in the various utilities headers. These follow the same conventions as the persistence functions - there is a print function for each basic type and then template print_class functions for each template class. Like the dump_class and restore_class functions, these cannot be overloaded (VC++ cannot handle overloaded templates), so for example the print function for vector is called print_vector. It is declared in string_utilities.hpp. The print functions for basic types are also declared in string_utilities.hpp.

The following functions are added to the address class to make it printable:

class address
{
  ...
  friend otext& print(otext& str, const address& entry)
    {
      return str << entry.m_name << " <" << entry.m_email << "> aged " << entry.m_age;
    }

  friend otext& print(otext& str, const address& entry, unsigned indent)
    {
      print_indent(str, indent);
      print(str, entry);
      return str << endl;
    }
};

The convention with print functions is to supply two functions: one which prints inline - i.e. without line breaks - and a second with an extra indent parameter which prints the object indented on a line of its own. The second is typically written so that it calls the first, as in this case.

The address_book type is now printable by using the print_vector functions which simple call the print function for each element. However, as before, it is more consistent to provide a non-template function called just print:

otext& print(otext& str, const address_book& addresses, unsigned indent)
{
  return print_vector(str, addresses, indent);
}

It is now possible to print the address book before and after the dump/restore:

int main(unsigned argc, char* argv[])
{
  ...
  ferr << "addresses:" << endl;
  print(ferr, addresses, 1);

  ferr << "restored addresses:" << endl;
  print(ferr, restored, 1);

  return 0;
}

Since this is a test program, it would be better if the program tested the equality of the before and after address books. The STL defines vector equality (operator==) in terms of the equality of the elements, so it is only necessary to give the address class an equality operator and the problem is solved:

class address
{
  ...
  friend void operator == (const address& left, const address& right)
    {
      return (left.m_name == right.m_name) && (left.m_email == right.m_email) && (left.m_age == right.m_age);
    }
};

The test program can now have a test for success or failure added at the end:

int main(unsigned argc, char* argv[])
{
  ...
  // verify that the address books are the same
  if (addresses != restored)
  {
    ferr << "restored addresses are different - Boo" << endl;
    return 3;
  }
  ferr << "restored addresses are the same - Hooray" << endl;
  return 0;
}

The output of this program when run is:

addresses:
  Andy Rushton <ajr1@ecs.soton.ac.uk> aged 40
  Andrew Brown <adb@ecs.soton.ac.uk> aged 85
  Mark Zwolinski <mz@ecs.soton.ac.uk> aged 21
restored addresses:
  Andy Rushton <ajr1@ecs.soton.ac.uk> aged 40
  Andrew Brown <adb@ecs.soton.ac.uk> aged 85
  Mark Zwolinski <mz@ecs.soton.ac.uk> aged 21
restored addresses are the same - Hooray