PartÂ V.Â Strings

PartÂ V.Â Strings
PrevÂ	The GNU C++ Library	Â Next

ChapterÂ 7.Â + Strings + +
PrevÂ	PartÂ II.Â + Standard Contents +	Â Next

+ Here are Standard, simple, and portable ways to perform common + transformations on a string instance, such as + "convert to all upper case." The word transformations + is especially apt, because the standard template function + transform<> is used. +

+ This code will go through some iterations. Here's a simple + version: +

+   #include <string>
+   #include <algorithm>
+   #include <cctype>      // old <ctype.h>
+
+   struct ToLower
+   {
+     char operator() (char c) const  { return std::tolower(c); }
+   };
+
+   struct ToUpper
+   {
+     char operator() (char c) const  { return std::toupper(c); }
+   };
+
+   int main()
+   {
+     std::string  s ("Some Kind Of Initial Input Goes Here");
+
+     // Change everything into upper case
+     std::transform (s.begin(), s.end(), s.begin(), ToUpper());
+
+     // Change everything into lower case
+     std::transform (s.begin(), s.end(), s.begin(), ToLower());
+
+     // Change everything back into upper case, but store the
+     // result in a different string
+     std::string  capital_s;
+     capital_s.resize(s.size());
+     std::transform (s.begin(), s.end(), capital_s.begin(), ToUpper());
+   }
+

+ Note that these calls all + involve the global C locale through the use of the C functions + toupper/tolower. This is absolutely guaranteed to work -- + but only if the string contains only characters + from the basic source character set, and there are only + 96 of those. Which means that not even all English text can be + represented (certain British spellings, proper names, and so forth). + So, if all your input forevermore consists of only those 96 + characters (hahahahahaha), then you're done. +

Note that the + ToUpper and ToLower function objects + are needed because toupper and tolower + are overloaded names (declared in <cctype> and + <locale>) so the template-arguments for + transform<> cannot be deduced, as explained in + this + message. + + At minimum, you can write short wrappers like +

+   char toLower (char c)
+   {
+      return std::tolower(c);
+   }

(Thanks to James Kanze for assistance and suggestions on all of this.) +

Another common operation is trimming off excess whitespace. Much + like transformations, this task is trivial with the use of string's + find family. These examples are broken into multiple + statements for readability: +

+   std::string  str (" \t blah blah blah    \n ");
+
+   // trim leading whitespace
+   string::size_type  notwhite = str.find_first_not_of(" \t\n");
+   str.erase(0,notwhite);
+
+   // trim trailing whitespace
+   notwhite = str.find_last_not_of(" \t\n");
+   str.erase(notwhite+1);

Obviously, the calls to find could be inserted directly + into the calls to erase, in case your compiler does not + optimize named temporaries out of existence. +

Case Sensitivity

The well-known-and-if-it-isn't-well-known-it-ought-to-be + Guru of the Week + discussions held on Usenet covered this topic in January of 1998. + Briefly, the challenge was, âwrite a 'ci_string' class which + is identical to the standard 'string' class, but is + case-insensitive in the same way as the (common but nonstandard) + C function stricmp()â. +

+   ci_string s( "AbCdE" );
+
+   // case insensitive
+   assert( s == "abcde" );
+   assert( s == "ABCDE" );
+
+   // still case-preserving, of course
+   assert( strcmp( s.c_str(), "AbCdE" ) == 0 );
+   assert( strcmp( s.c_str(), "abcde" ) != 0 );

The solution is surprisingly easy. The original answer was + posted on Usenet, and a revised version appears in Herb Sutter's + book Exceptional C++ and on his website as GotW 29. +

See? Told you it was easy!

+ Added June 2000: The May 2000 issue of C++ + Report contains a fascinating article by + Matt Austern (yes, the Matt Austern) on why + case-insensitive comparisons are not as easy as they seem, and + why creating a class is the wrong way to go + about it in production code. (The GotW answer mentions one of + the principle difficulties; his article mentions more.) +

Basically, this is "easy" only if you ignore some things, + things which may be too important to your program to ignore. (I chose + to ignore them when originally writing this entry, and am surprised + that nobody ever called me on it...) The GotW question and answer + remain useful instructional tools, however. +

Added September 2000: James Kanze provided a link to a + Unicode + Technical Report discussing case handling, which provides some + very good information. +

Arbitrary Character Types

The std::basic_string is tantalizingly general, in that + it is parameterized on the type of the characters which it holds. + In theory, you could whip up a Unicode character class and instantiate + std::basic_string<my_unicode_char>, or assuming + that integers are wider than characters on your platform, maybe just + declare variables of type std::basic_string<int>. +

That's the theory. Remember however that basic_string has additional + type parameters, which take default arguments based on the character + type (called CharT here): +

+      template <typename CharT,
+		typename Traits = char_traits<CharT>,
+		typename Alloc = allocator<CharT> >
+      class basic_string { .... };

Now, allocator<CharT> will probably Do The Right + Thing by default, unless you need to implement your own allocator + for your characters. +

But char_traits takes more work. The char_traits + template is declared but not defined. + That means there is only +

+      template <typename CharT>
+	struct char_traits
+	{
+	    static void foo (type1 x, type2 y);
+	    ...
+	};

and functions such as char_traits<CharT>::foo() are not + actually defined anywhere for the general case. The C++ standard + permits this, because writing such a definition to fit all possible + CharT's cannot be done. +

The C++ standard also requires that char_traits be specialized for + instantiations of char and wchar_t, and it + is these template specializations that permit entities like + basic_string<char,char_traits<char>> to work. +

If you want to use character types other than char and wchar_t, + such as unsigned char and int, you will + need suitable specializations for them. For a time, in earlier + versions of GCC, there was a mostly-correct implementation that + let programmers be lazy but it broke under many situations, so it + was removed. GCC 3.4 introduced a new implementation that mostly + works and can be specialized even for int and other + built-in types. +

If you want to use your own special character class, then you have + a lot + of work to do, especially if you with to use i18n features + (facets require traits information but don't have a traits argument). +

Another example of how to specialize char_traits was given on the + mailing list and at a later date was put into the file + include/ext/pod_char_traits.h. We agree + that the way it's used with basic_string (scroll down to main()) + doesn't look nice, but that's because the + nice-looking first attempt turned out to not + be conforming C++, due to the rule that CharT must be a POD. + (See how tricky this is?) +

Tokenizing

The Standard C (and C++) function strtok() leaves a lot to + be desired in terms of user-friendliness. It's unintuitive, it + destroys the character string on which it operates, and it requires + you to handle all the memory problems. But it does let the client + code decide what to use to break the string into pieces; it allows + you to choose the "whitespace," so to speak. +

A C++ implementation lets us keep the good things and fix those + annoyances. The implementation here is more intuitive (you only + call it once, not in a loop with varying argument), it does not + affect the original string at all, and all the memory allocation + is handled for you. +

It's called stringtok, and it's a template function. Sources are + as below, in a less-portable form than it could be, to keep this + example simple (for example, see the comments on what kind of + string it will accept). +

+#include <string>
+template <typename Container>
+void
+stringtok(Container &container, string const &in,
+	  const char * const delimiters = " \t\n")
+{
+    const string::size_type len = in.length();
+	  string::size_type i = 0;
+
+    while (i < len)
+    {
+	// Eat leading whitespace
+	i = in.find_first_not_of(delimiters, i);
+	if (i == string::npos)
+	  return;   // Nothing left but white space
+
+	// Find the end of the token
+	string::size_type j = in.find_first_of(delimiters, i);
+
+	// Push token
+	if (j == string::npos)
+	{
+	  container.push_back(in.substr(i));
+	  return;
+	}
+	else
+	  container.push_back(in.substr(i, j-i));
+
+	// Set up for next loop
+	i = j + 1;
+    }
+}
+

+ The author uses a more general (but less readable) form of it for + parsing command strings and the like. If you compiled and ran this + code using it: +

+   std::list<string>  ls;
+   stringtok (ls, " this  \t is\t\n  a test  ");
+   for (std::list<string>const_iterator i = ls.begin();
+	i != ls.end(); ++i)
+   {
+       std::cerr << ':' << (*i) << ":\n";
+   }

You would see this as output: +

+   :this:
+   :is:
+   :a:
+   :test:

with all the whitespace removed. The original s is still + available for use, ls will clean up after itself, and + ls.size() will return how many tokens there were. +

As always, there is a price paid here, in that stringtok is not + as fast as strtok. The other benefits usually outweigh that, however. +

Added February 2001: Mark Wilden pointed out that the + standard std::getline() function can be used with standard + istringstreams to perform + tokenizing as well. Build an istringstream from the input text, + and then use std::getline with varying delimiters (the three-argument + signature) to extract tokens into a string. +

Shrink to Fit

From GCC 3.4 calling s.reserve(res) on a + string s with res < s.capacity() will + reduce the string's capacity to std::max(s.size(), res). +

This behaviour is suggested, but not required by the standard. Prior + to GCC 3.4 the following alternative can be used instead +

+      std::string(str.data(), str.size()).swap(str);
+

This is similar to the idiom for reducing + a vector's memory usage + (see this FAQ + entry) but the regular copy constructor cannot be used + because libstdc++'s string is Copy-On-Write. +

In C++11 mode you can call + s.shrink_to_fit() to achieve the same effect as + s.reserve(s.size()). +

CString (MFC)

A common lament seen in various newsgroups deals with the Standard + string class as opposed to the Microsoft Foundation Class called + CString. Often programmers realize that a standard portable + answer is better than a proprietary nonportable one, but in porting + their application from a Win32 platform, they discover that they + are relying on special functions offered by the CString class. +

Things are not as bad as they seem. In + this + message, Joe Buck points out a few very important things: +

The Standard string supports all the operations + that CString does, with three exceptions. +
Two of those exceptions (whitespace trimming and case + conversion) are trivial to implement. In fact, we do so + on this page. +
The third is CString::Format, which allows formatting + in the style of sprintf. This deserves some mention: +

+ The old libg++ library had a function called form(), which did much + the same thing. But for a Standard solution, you should use the + stringstream classes. These are the bridge between the iostream + hierarchy and the string class, and they operate with regular + streams seamlessly because they inherit from the iostream + hierarchy. An quick example: +

+   #include <iostream>
+   #include <string>
+   #include <sstream>
+
+   string f (string& incoming)     // incoming is "foo  N"
+   {
+       istringstream   incoming_stream(incoming);
+       string          the_word;
+       int             the_number;
+
+       incoming_stream >> the_word        // extract "foo"
+		       >> the_number;     // extract N
+
+       ostringstream   output_stream;
+       output_stream << "The word was " << the_word
+		     << " and 3*N was " << (3*the_number);
+
+       return output_stream.str();
+   }

A serious problem with CString is a design bug in its memory + allocation. Specifically, quoting from that same message: +

+   CString suffers from a common programming error that results in
+   poor performance.  Consider the following code:
+
+   CString n_copies_of (const CString& foo, unsigned n)
+   {
+	   CString tmp;
+	   for (unsigned i = 0; i < n; i++)
+		   tmp += foo;
+	   return tmp;
+   }
+
+   This function is O(n^2), not O(n).  The reason is that each +=
+   causes a reallocation and copy of the existing string.  Microsoft
+   applications are full of this kind of thing (quadratic performance
+   on tasks that can be done in linear time) -- on the other hand,
+   we should be thankful, as it's created such a big market for high-end
+   ix86 hardware. :-)
+
+   If you replace CString with string in the above function, the
+   performance is O(n).
+

Joe Buck also pointed out some other things to keep in mind when + comparing CString and the Standard string class: +

CString permits access to its internal representation; coders + who exploited that may have problems moving to string. +
Microsoft ships the source to CString (in the files + MFC\SRC\Str{core,ex}.cpp), so you could fix the allocation + bug and rebuild your MFC libraries. + Note: It looks like the CString shipped + with VC++6.0 has fixed this, although it may in fact have been + one of the VC++ SPs that did it. +
string operations like this have O(n) complexity + if the implementors do it correctly. The libstdc++ + implementors did it correctly. Other vendors might not. +
While chapters of the SGI STL are used in libstdc++, their + string class is not. The SGI string is essentially + vector<char> and does not do any reference + counting like libstdc++'s does. (It is O(n), though.) + So if you're thinking about SGI's string or rope classes, + you're now looking at four possibilities: CString, the + libstdc++ string, the SGI string, and the SGI rope, and this + is all before any allocator or traits customizations! (More + choices than you can shake a stick at -- want fries with that?) +

PartÂ V.Â Strings

ChapterÂ 7.Â + Strings + +

String Classes

Simple Transformations

Case Sensitivity

Arbitrary Character Types

Tokenizing

Shrink to Fit

CString (MFC)

PrevÂ	Up	Â Next
shared_ptrÂ	Home	Â ChapterÂ 13.Â String Classes

PrevÂ	Up	Â Next
TraitsÂ	Home	Â ChapterÂ 8.Â + Localization + +