X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=libstdc%2B%2B-v3%2Fdoc%2Fhtml%2Fmanual%2Fstrings.html;h=c0d80e3d11fb905a41b2b98ce2077445f2664f43;hb=f25481f470c2810f6af2a7fcd76e2a0804b5f738;hp=99d8cbed0e38bf4cd16bf3d2cf07bb71a4e3b369;hpb=46abada07fd5354741fa2d12147b0ff22b858fb4;p=gcc.git diff --git a/libstdc++-v3/doc/html/manual/strings.html b/libstdc++-v3/doc/html/manual/strings.html index 99d8cbed0e3..c0d80e3d11f 100644 --- a/libstdc++-v3/doc/html/manual/strings.html +++ b/libstdc++-v3/doc/html/manual/strings.html @@ -1,3 +1,366 @@ - -
Table of Contents
+ Here are Standard, simple, and portable ways to perform common
+ transformations on a string
instance, such as
+ "convert to all upper case." The word transformations
+ is especially apt, because the standard template function
+ transform<>
is used.
+
+ This code will go through some iterations. Here's a simple + version: +
+ #include <string> + #include <algorithm> + #include <cctype> // old <ctype.h> + + struct ToLower + { + char operator() (char c) const { return std::tolower(c); } + }; + + struct ToUpper + { + char operator() (char c) const { return std::toupper(c); } + }; + + int main() + { + std::string s ("Some Kind Of Initial Input Goes Here"); + + // Change everything into upper case + std::transform (s.begin(), s.end(), s.begin(), ToUpper()); + + // Change everything into lower case + std::transform (s.begin(), s.end(), s.begin(), ToLower()); + + // Change everything back into upper case, but store the + // result in a different string + std::string capital_s; + capital_s.resize(s.size()); + std::transform (s.begin(), s.end(), capital_s.begin(), ToUpper()); + } +
+ Note that these calls all
+ involve the global C locale through the use of the C functions
+ toupper/tolower
. This is absolutely guaranteed to work --
+ but only if the string contains only characters
+ from the basic source character set, and there are only
+ 96 of those. Which means that not even all English text can be
+ represented (certain British spellings, proper names, and so forth).
+ So, if all your input forevermore consists of only those 96
+ characters (hahahahahaha), then you're done.
+
Note that the
+ ToUpper
and ToLower
function objects
+ are needed because toupper
and tolower
+ are overloaded names (declared in <cctype>
and
+ <locale>
) so the template-arguments for
+ transform<>
cannot be deduced, as explained in
+ this
+ message.
+
+ At minimum, you can write short wrappers like
+
+ char toLower (char c) + { + return std::tolower(c); + }
(Thanks to James Kanze for assistance and suggestions on all of this.) +
Another common operation is trimming off excess whitespace. Much
+ like transformations, this task is trivial with the use of string's
+ find
family. These examples are broken into multiple
+ statements for readability:
+
+ std::string str (" \t blah blah blah \n "); + + // trim leading whitespace + string::size_type notwhite = str.find_first_not_of(" \t\n"); + str.erase(0,notwhite); + + // trim trailing whitespace + notwhite = str.find_last_not_of(" \t\n"); + str.erase(notwhite+1);
Obviously, the calls to find
could be inserted directly
+ into the calls to erase
, in case your compiler does not
+ optimize named temporaries out of existence.
+
+
The well-known-and-if-it-isn't-well-known-it-ought-to-be + Guru of the Week + discussions held on Usenet covered this topic in January of 1998. + Briefly, the challenge was, âwrite a 'ci_string' class which + is identical to the standard 'string' class, but is + case-insensitive in the same way as the (common but nonstandard) + C function stricmp()â. +
+ ci_string s( "AbCdE" ); + + // case insensitive + assert( s == "abcde" ); + assert( s == "ABCDE" ); + + // still case-preserving, of course + assert( strcmp( s.c_str(), "AbCdE" ) == 0 ); + assert( strcmp( s.c_str(), "abcde" ) != 0 );
The solution is surprisingly easy. The original answer was + posted on Usenet, and a revised version appears in Herb Sutter's + book Exceptional C++ and on his website as GotW 29. +
See? Told you it was easy!
+ Added June 2000: The May 2000 issue of C++ + Report contains a fascinating article by + Matt Austern (yes, the Matt Austern) on why + case-insensitive comparisons are not as easy as they seem, and + why creating a class is the wrong way to go + about it in production code. (The GotW answer mentions one of + the principle difficulties; his article mentions more.) +
Basically, this is "easy" only if you ignore some things, + things which may be too important to your program to ignore. (I chose + to ignore them when originally writing this entry, and am surprised + that nobody ever called me on it...) The GotW question and answer + remain useful instructional tools, however. +
Added September 2000: James Kanze provided a link to a + Unicode + Technical Report discussing case handling, which provides some + very good information. +
+
The std::basic_string
is tantalizingly general, in that
+ it is parameterized on the type of the characters which it holds.
+ In theory, you could whip up a Unicode character class and instantiate
+ std::basic_string<my_unicode_char>
, or assuming
+ that integers are wider than characters on your platform, maybe just
+ declare variables of type std::basic_string<int>
.
+
That's the theory. Remember however that basic_string has additional
+ type parameters, which take default arguments based on the character
+ type (called CharT
here):
+
+ template <typename CharT, + typename Traits = char_traits<CharT>, + typename Alloc = allocator<CharT> > + class basic_string { .... };
Now, allocator<CharT>
will probably Do The Right
+ Thing by default, unless you need to implement your own allocator
+ for your characters.
+
But char_traits
takes more work. The char_traits
+ template is declared but not defined.
+ That means there is only
+
+ template <typename CharT> + struct char_traits + { + static void foo (type1 x, type2 y); + ... + };
and functions such as char_traits<CharT>::foo() are not + actually defined anywhere for the general case. The C++ standard + permits this, because writing such a definition to fit all possible + CharT's cannot be done. +
The C++ standard also requires that char_traits be specialized for
+ instantiations of char
and wchar_t
, and it
+ is these template specializations that permit entities like
+ basic_string<char,char_traits<char>>
to work.
+
If you want to use character types other than char and wchar_t,
+ such as unsigned char
and int
, you will
+ need suitable specializations for them. For a time, in earlier
+ versions of GCC, there was a mostly-correct implementation that
+ let programmers be lazy but it broke under many situations, so it
+ was removed. GCC 3.4 introduced a new implementation that mostly
+ works and can be specialized even for int
and other
+ built-in types.
+
If you want to use your own special character class, then you have + a lot + of work to do, especially if you with to use i18n features + (facets require traits information but don't have a traits argument). +
Another example of how to specialize char_traits was given on the
+ mailing list and at a later date was put into the file
+ include/ext/pod_char_traits.h
. We agree
+ that the way it's used with basic_string (scroll down to main())
+ doesn't look nice, but that's because the
+ nice-looking first attempt turned out to not
+ be conforming C++, due to the rule that CharT must be a POD.
+ (See how tricky this is?)
+
+
The Standard C (and C++) function strtok()
leaves a lot to
+ be desired in terms of user-friendliness. It's unintuitive, it
+ destroys the character string on which it operates, and it requires
+ you to handle all the memory problems. But it does let the client
+ code decide what to use to break the string into pieces; it allows
+ you to choose the "whitespace," so to speak.
+
A C++ implementation lets us keep the good things and fix those + annoyances. The implementation here is more intuitive (you only + call it once, not in a loop with varying argument), it does not + affect the original string at all, and all the memory allocation + is handled for you. +
It's called stringtok, and it's a template function. Sources are + as below, in a less-portable form than it could be, to keep this + example simple (for example, see the comments on what kind of + string it will accept). +
+#include <string> +template <typename Container> +void +stringtok(Container &container, string const &in, + const char * const delimiters = " \t\n") +{ + const string::size_type len = in.length(); + string::size_type i = 0; + + while (i < len) + { + // Eat leading whitespace + i = in.find_first_not_of(delimiters, i); + if (i == string::npos) + return; // Nothing left but white space + + // Find the end of the token + string::size_type j = in.find_first_of(delimiters, i); + + // Push token + if (j == string::npos) + { + container.push_back(in.substr(i)); + return; + } + else + container.push_back(in.substr(i, j-i)); + + // Set up for next loop + i = j + 1; + } +} +
+ The author uses a more general (but less readable) form of it for + parsing command strings and the like. If you compiled and ran this + code using it: +
+ std::list<string> ls; + stringtok (ls, " this \t is\t\n a test "); + for (std::list<string>const_iterator i = ls.begin(); + i != ls.end(); ++i) + { + std::cerr << ':' << (*i) << ":\n"; + }
You would see this as output: +
+ :this: + :is: + :a: + :test:
with all the whitespace removed. The original s
is still
+ available for use, ls
will clean up after itself, and
+ ls.size()
will return how many tokens there were.
+
As always, there is a price paid here, in that stringtok is not + as fast as strtok. The other benefits usually outweigh that, however. +
Added February 2001: Mark Wilden pointed out that the
+ standard std::getline()
function can be used with standard
+ istringstreams
to perform
+ tokenizing as well. Build an istringstream from the input text,
+ and then use std::getline with varying delimiters (the three-argument
+ signature) to extract tokens into a string.
+
+
From GCC 3.4 calling s.reserve(res)
on a
+ string s
with res < s.capacity()
will
+ reduce the string's capacity to std::max(s.size(), res)
.
+
This behaviour is suggested, but not required by the standard. Prior + to GCC 3.4 the following alternative can be used instead +
+ std::string(str.data(), str.size()).swap(str); +
This is similar to the idiom for reducing
+ a vector
's memory usage
+ (see this FAQ
+ entry) but the regular copy constructor cannot be used
+ because libstdc++'s string
is Copy-On-Write.
+
In C++11 mode you can call
+ s.shrink_to_fit()
to achieve the same effect as
+ s.reserve(s.size())
.
+
+
A common lament seen in various newsgroups deals with the Standard + string class as opposed to the Microsoft Foundation Class called + CString. Often programmers realize that a standard portable + answer is better than a proprietary nonportable one, but in porting + their application from a Win32 platform, they discover that they + are relying on special functions offered by the CString class. +
Things are not as bad as they seem. In + this + message, Joe Buck points out a few very important things: +
The Standard string
supports all the operations
+ that CString does, with three exceptions.
+
Two of those exceptions (whitespace trimming and case + conversion) are trivial to implement. In fact, we do so + on this page. +
The third is CString::Format
, which allows formatting
+ in the style of sprintf
. This deserves some mention:
+
+ The old libg++ library had a function called form(), which did much + the same thing. But for a Standard solution, you should use the + stringstream classes. These are the bridge between the iostream + hierarchy and the string class, and they operate with regular + streams seamlessly because they inherit from the iostream + hierarchy. An quick example: +
+ #include <iostream> + #include <string> + #include <sstream> + + string f (string& incoming) // incoming is "foo N" + { + istringstream incoming_stream(incoming); + string the_word; + int the_number; + + incoming_stream >> the_word // extract "foo" + >> the_number; // extract N + + ostringstream output_stream; + output_stream << "The word was " << the_word + << " and 3*N was " << (3*the_number); + + return output_stream.str(); + }
A serious problem with CString is a design bug in its memory + allocation. Specifically, quoting from that same message: +
+ CString suffers from a common programming error that results in + poor performance. Consider the following code: + + CString n_copies_of (const CString& foo, unsigned n) + { + CString tmp; + for (unsigned i = 0; i < n; i++) + tmp += foo; + return tmp; + } + + This function is O(n^2), not O(n). The reason is that each += + causes a reallocation and copy of the existing string. Microsoft + applications are full of this kind of thing (quadratic performance + on tasks that can be done in linear time) -- on the other hand, + we should be thankful, as it's created such a big market for high-end + ix86 hardware. :-) + + If you replace CString with string in the above function, the + performance is O(n). +
Joe Buck also pointed out some other things to keep in mind when + comparing CString and the Standard string class: +
CString permits access to its internal representation; coders
+ who exploited that may have problems moving to string
.
+
Microsoft ships the source to CString (in the files + MFC\SRC\Str{core,ex}.cpp), so you could fix the allocation + bug and rebuild your MFC libraries. + Note: It looks like the CString shipped + with VC++6.0 has fixed this, although it may in fact have been + one of the VC++ SPs that did it. +
string
operations like this have O(n) complexity
+ if the implementors do it correctly. The libstdc++
+ implementors did it correctly. Other vendors might not.
+
While chapters of the SGI STL are used in libstdc++, their
+ string class is not. The SGI string
is essentially
+ vector<char>
and does not do any reference
+ counting like libstdc++'s does. (It is O(n), though.)
+ So if you're thinking about SGI's string or rope classes,
+ you're now looking at four possibilities: CString, the
+ libstdc++ string, the SGI string, and the SGI rope, and this
+ is all before any allocator or traits customizations! (More
+ choices than you can shake a stick at -- want fries with that?)
+