The STL+ C++ library
Andy Rushton
This is a set of functions that perform string conversions and printing routines. There are functions here for converting integers to/from string representations as well as floating-point types to/from string representations. There are template functions that make it easy to capture a string representation of a data structure such as a vector by simply defining a string conversion function for the element. There are supporting functions for printing the same types - i.e. printing integer, floating point types and template print functions for data structures.
The string formatting functions in
string_utilities.hpp
are controlled by a set of enumeration types which have been
separated out into the header
format_types.hpp.
This is because they are also used to control formatting in the
TextIO classes.
In addition to these formatting routines, there are some common string operations such as wildcard matching and the Perl-like split and join functions.
C is a bit limited in its ability to display different radix (base) numbers. Basically you can print in decimal (e.g. 1234), octal (e.g. 01234) or hexadecimal (e.g. 0x1234) only. Sometimes you need other radices - in particular, binary and base 13 (grin).
To improve on this situation, the string formatting functions for integer types (but not for floating point types) support all radices from base 2 to base 36 using the character set [0-9a-z]. It offers three different formatting options for showing the radix:
Hash-style format starts with the base in decimal, a '#', then the number in the specified base. For example, 16#ff is 255 in hexadecimal. The advantage of this format is that it can be applied to any base. Thus 36#zz is a valid hash-style number. Its value is left as an exercise for the reader (translation: I'm too lazy to work it out).
Hash-style format is a sign-magnitude format. A negative value has the sign after the # character: 16#-ff represents -255 in hexadecimal.
Each integer formatting function takes as an argument an enumeration of type radix_display_t specifying which formatting style to use for the output. The type has the following values:
Note that the only styles that are guaranteed to give a value that can be correctly converted back to an integer again are: radix_hash_style, radix_hash_style_all and radix_c_style_or_hash. The last of these is the recommended style for all printing since it is the most natural combination - decimal is printed as a number (e.g. 1234), binary, octal and hex are in familiar C-style (e.g. 0b0100100, 01234 or 0x1234) and all other bases are in hash style (e.g. 4#3210). Indeed, radix_c_style_or_hash is the default format for all the string formatting functions.
When formatting real numbers as strings, there are three formats supported. These are controlled by the enumeration type real_format_t. It has the following values:
There is a whole family of functions called to_string which take an integer type and format it into a std::string. The parameter profile of these functions is:
std::string to_string(type i,
unsigned radix = 10,
radix_display_t display = radix_c_style_or_hash,
unsigned width = 0)
throw(std::invalid_argument);
In this case, type is any integer type - namely bool, short, unsigned short, int, unsigned, long and unsigned long.
The width parameter specifies the minimum number of digits to use to represent the value. The result may be larger than this if the value doesn't fit in the specified width. The default of 0 means use the minimum number of digits to represent the value. Any prefix that indicates the radix is in addition to this, so if you ask for, for example, zero in hexadecimal using C style with a width of 4, you will get 0x0000. Using hash style will give 16#0000.
The exception std::invalid_argument will be thrown if the radix is not in the range 2-36 or the display enumeration is illegal.
The default values mean that the functions can be used with just a single parameter:
string s = to_string(i);
In this case, the output will be in decimal with no formatting codes (since radix_c_style_or_hash prints decimal as just a simple number).
There is one last form of to_string in this set that is worth noting:
std::string to_string(const void*,
unsigned radix = 16,
radix_display_t display = radix_c_style_or_hash,
unsigned width = 0)
throw(std::invalid_argument);
This prints out an address as a number (any address, since in C any pointer can be treated as a void*). The default radix is set to 16 because most people expect addresses to be in hex.
These functions do the reverse conversion, taking a string as an argument and returning the integer value represented. They recognise the normal C-style formatting and the hash-style formatting so can read a string written in any base.
The integer conversion functions are of the form:
type to_type(const std::string& value, unsigned radix = 0); throw(std::invalid_argument);
where type is bool, short, unsigned short, int, unsigned int, long or unsigned long.
A radix of 0 means work out the radix from the string. The default is then 10. Any other radix will force the default to be that radix. Thus if you have a number which has been printed using radix_none but with a radix of 32, you can convert it back to integer by specifying a radix of 32. However, any number printed using the default radix_c_style_or_hash will be read correctly without specifying a conversion radix.
The exception std::invalid_argument will be thrown if a radix is specified outside the range 2-36.
And finally, there is the reverse conversion for pointers:
void* to_void_star(const std::string& value, unsigned radix = 0) throw(std::invalid_argument);
There are two to_string functions which format the three C++ real types to a string representation. These are:
std::string to_string(float f,
real_display_t display = display_mixed,
unsigned width = 0,
unsigned precision = 6);
std::string to_string(double f,
real_display_t display = display_mixed,
unsigned width = 0,
unsigned precision = 6);
The default values are chosen to give reasonable displays for most applications. The default format is display_mixed (equivalent to "%g") with a precision of 6 decimal places and no field width - which gives a minimum field width. See dprintf.hpp for the meanings of the precision and field widtyh for floating point numbers..
Once again there are two conversions from string to real types, one for each C++ real type. These are:
float to_float(const std::string& value); double to_double(const std::string& value);
These conversions will accept strings formatted in any of the formats which can be used by the real to_string functions, so there is symmetry here.
There is a set of functions which are also called xxx_to_string (where xxx is a type name) but which are templates. They give a convenient way of providing string formatting for the most-commonly uses STL and STL+ container classes. They rely on you writing a to_string function for the type contained within the container.
Since the element type is unknown, a simpler parameter profile is used for these functions, missing all the formatting parameters. However, most data types do have other formatting parameters such as separators for multi-element types. Here's the set of functions:
template <typename T> std::string pointer_to_string(const T* value, const std::string& null_string, const std::string& prefix, const std::string& suffix); template<size_t N> std::string bitset_to_string(const std::bitset<N>& data); template<typename T> std::string list_to_string(const std::list<T>& values, std::string separator); template<typename L, typename R> std::string pair_to_string(const std::pair<L,R>& values, std::string separator); template<typename K, typename T, typename P> std::string map_to_string(const std::map<K,T,P>& values, const std::string& pair_separator, const std::string& separator); template<typename K, typename T, typename P> std::string multimap_to_string(const std::multimap<K,T,P>& values, const std::string& pair_separator, const std::string& separator); template<typename K, typename P> std::string set_to_string(const std::set<K,P>& values, const std::string& separator); template<typename K, typename P> std::string multiset_to_string(const std::multiset<K,P>& values, const std::string& separator); template<typename T> std::string vector_to_string(const std::vector<T>& values, std::string separator);
In addition to these functions, the STLplus containers define their own string conversion functions which are compatible with these.
Note: the reason these are not just called to_string is that the MicroSoft VC++ compiler chokes on overloaded template functions.
The vector, list, set and multiset functions simply call to_string on each element and build up a composite result with each element separated by the separator (which defaults to a comma).
The map and multimap functions are similar, but each element is a pair, which is printed by calling the pair_to_string function. This in turn calls to_string on each of the elements of the pair.
The pointer conversion routines print the type within the following format: <prefix><value><suffix>. Typically, prefix = "*(" and suffix = ")", so the value is printed like this: "*(value)". This is meant to read as "pointer to value". If the pointer is null, the null_string is printed instead, with no prefix or suffix.
To use these functions, you need to write a to_string function for the element type of the template with one of the following profiles:
std::string to_string(type value); std::string to_string(const type& value);
Due to the use of default values, the integer formatting to_string functions are perfectly suited to this, so for example, a vector of int will print without writing a single line of extra code.
vector<int> values = ...; ... ferr << "the values are: " << vector_to_string(values,",") << endl;
To print a template container which contains another template container, write a to_string function simply calling the xxx_to_string function. For example, to write a to_string function for a list<vector<string>> you go through the following sequence:
First, there is already a to_string function for type string (with a pretty trivial implementation!), so you have nothing to do there. However, if the lowest level type did not already have a to_string function you would need to write one.
Now write a to_string function for the vector<string> type:
string to_string(const vector<string>& values)
{
return vector_to_string(values,":");
}
Note that this will create a colon-separated list.
Finally, it is possible to write the top-level to_string function for the list:
string to_string(const list<vector<string>>& values)
{
return list_to_string(values,",");
}
Note that this creates a comma-separated list. Thus, overall, the string will contain a comma-separated list of colon-separated strings.
In parallel with the set of string conversion routines, there is a set of print routines for the same set of types. The convention is to have a print function for printing in-line (i.e. the value is on one line) and to have an overloaded print function to print on a whole line, with indent before and newline after.
Indentation is controlled by the following static functions:
void set_indent_step(unsigned step); unsigned indent_step(void); otext& print_indent(otext& str, unsigned indent);
The default indent step is 2 characters, so that means that by increasing the indent value by one, you'll increase the indent by these two characters. This is useful for printing structure. For example, you might choose to print a vector by printing it's size and then an indented set of values:
otext& print(otext& str, const vector<string>& values, unsigned indent)
{
print_indent(str, indent); str << values.size() << endl;
print_vector(str, values, indent+1);
}
In fact, only the inline version of the print functions are provided for the basic low-level C types because it is unlikely that they will be printed on a line of their own:
The integer print routines have the following profile:
otext& print(otext& str, type value,
unsigned radix = 10, radix_display_t display = radix_c_style_or_hash, unsigned width = 0)
throw(std::invalid_argument);
In this case, type is any integer type - namely bool, short, unsigned short, int, unsigned, long and unsigned long.
The extra parameters have the same meaning as for the to_string functions.
Similarly, floating-point types are handled:
otext& print(otext& str, type f,
real_display_t display = display_mixed, unsigned width = 0, unsigned precision = 6)
throw(std::invalid_argument);
There are print routines for the template classes which are compatible with the to_string functions for templates. These print functions have the name print_class where class is the container class to print, e.g. vector. The set of print functions are:
template <typename T>
otext& print_pointer(otext& str, const T* value,
const std::string& null_string, const std::string& prefix, const std::string& suffix);
template <typename T>
otext& print_pointer(otext& str, const T* value, unsigned indent,
const std::string& null_string, const std::string& prefix, const std::string& suffix);
template<size_t N>
otext& print_bitset(otext& str, const std::bitset<N>& value);
template<size_t N>
otext& print_bitset(otext& str, const std::bitset<N>& value, unsigned indent);
template<typename T>
otext& print_list(otext& str, const std::list<T>& values, const std::string& separator);
template<typename T>
otext& print_list(otext& str, const std::list<T>& values, unsigned indent);
template<typename L, typename R>
otext& print_pair(otext& str, const std::pair<L,R>& values, const std::string& separator);
template<typename L, typename R>
otext& print_pair(otext& str, const std::pair<L,R>& values, const std::string& separator, unsigned indent);
template<typename K, typename T, typename P>
otext& print_map(otext& str, const std::map<K,T,P>& values, const std::string& pair_separator, const std::string& separator);
template<typename K, typename T, typename P>
otext& print_map(otext& str, const std::map<K,T,P>& values, const std::string& pair_separator, unsigned indent);
template<typename K, typename T, typename P>
otext& print_multimap(otext& str, const std::multimap<K,T,P>& values, const std::string& pair_separator, const std::string& separator);
template<typename K, typename T, typename P>
otext& print_multimap(otext& str, const std::multimap<K,T,P>& values, const std::string& pair_separator, unsigned indent);
template<typename K, typename P>
otext& print_set(otext& str, const std::set<K,P>& values, const std::string& separator);
template<typename K, typename P>
otext& print_set(otext& str, const std::set<K,P>& values, unsigned indent);
template<typename K, typename P>
otext& print_multiset(otext& str, const std::multiset<K,P>& values, const std::string& separator);
template<typename K, typename P>
otext& print_multiset(otext& str, const std::multiset<K,P>& values, unsigned indent);
template<typename T>
otext& print_vector(otext& str, const std::vector<T>& values, const std::string& separator);
template<typename T>
otext& print_vector(otext& str, const std::vector<T>& values, unsigned indent);
std::string pad(const std::string& str, alignment_t alignment, unsigned width, char padch); std::string trim_left(const std::string& val); std::string trim_right(const std::string& val); std::string trim(const std::string& val);
The pad function is in fact the one used to perform padding of the integer formats - it allows a string to be aligned in a fixed-width field.
This is controlled by an enumeration of type alignemnt_t which specifies how the string is to be aligned within the field. It has the following values:
If the field is not wide enough, the string is not truncated, it will be simply printed in full but with no padding.
The trim functions trim whitespace from the argument. The names are fairly self-explanatory - trim_left trims whitespace from the left of the string, trim_right from the right of the string and trim trims whitespace from both ends of the string. Whitespace is defined by the isspace function from <ctype.h>.
std::string lowercase(const std::string& val); std::string uppercase(const std::string& val);
The lowercase and uppercase functions are pretty self evident. Note that they do not modify their arguments, but return a new string which has been case-converted.
std::string translate(const std::string& input, const std::string& from_set, const std::string& to_set);
This function was inspired by the 'tr' function from Unix. It processes the input string to generate the return string by replacing any character in the from_set with the character in the same position in the to_set. In other words, if a character in the input string is found at index 17 of the from_set, the returned string will contain the character in index 17 of the to_set. If the to_set is smaller than the from_set, then the extra characters represent characters to delete - in other words they map onto nothing. If a character is not present in the from_set, it will be copied to the output unchanged.
For example:
string result = translate("fred123.txt", "abcdefghijklmnopqrstuvwxyz01234567890", "ABCDEFGHIJKLMNOPQRSTUVWXYZ");
This example will convert lowercase letters to uppercase letters. It will delete digits (the from_set is longer than the to_set) and copy anything else unchanged to the output. The result string will therefore be "FRED.TXT".
This function performs the kind of wildcard matching usually found in command-line tools for filename handling. Sometimes you need to do this kind of stuff yourself. Unfortunately, this kind of thing is not always provided by the operating system. For portability, then, this function should be used instead.
the function looks like this:
bool match_wildcard(const std::string& wild, const std::string& match);
The first argument is the wildcard expression and the second is the string to match against it. The function returns true (wow, surprise) if the match string does match the wild string.
The wildcard expression can contain any of the following:
Thus the wildcard expression "*.vhdl" matches any string ending in the sequence ".vhdl".
There are two functions in the Perl language which are incredibly useful for string manipulation, and which I therefore wanted in C++. These are the split and join functions. Basically the split function converts a string into a vector of strings by splitting the string at every occurrence of a splitter string. For example, a PATH can be split into its elements by splitting with ":" on Unix or ";" on Windows (see the subprocesses subsystem for a platform-independent interface for this, but yes it does use split internally). The reverse function is join, which converts a vector of strings into a single string by interleaving with a joiner string.
The function interfaces are:
std::vector<std::string> split (const std::string& str, const std::string& splitter = "\n"); std::string join (const std::vector<std::string>&, const std::string& joiner = "\n", const std::string& prefix = "", const std::string& suffix = "");
Note that the split function considers the start and the end of the string to be split points. It searches from the current split point to the next split point and adds the intervening text to the vector. It follows that if the splitter appears at the beginning or end of the string, an empty string will be added to the vector. Similarly, if two instances of the splitter appear consecutively in the string, an empty string will be added to the vector. This is correct behaviour, not a bug!
Note also that the join function allows you to add a prefix and a suffix to the resulting string, so for example a vector of values could be turned into a parenthesised, comma-separated string by a single call which sets the joiner=",", prefix="(" and suffix=")".
Another neat use of these functions is in converting one separator into another by nesting the calls. For example, to convert a colon-separated string into a comma-separated string, simply split and then join:
string value = "a:b:c:d:e"; string result = join(split(value,":"),",");
There are two functions for displaying either a byte-count or an elapsed time in seconds in a human-readable form.
std::string display_bytes(unsigned bytes);
This creates a string representation of the number of bytes, represented as a number in B, kB, MB or GB depending on the value. It is approximate in that the result is rounded to a sensible number of digits.
std::string display_time(unsigned seconds);
This function displays the parameter in seconds as a string representation in weeks, days, hours, minutes, seconds. For example, "4d 3:02:01" means 4 days, 3 hours, 2 minutes and 1 second.