It seems that a famous way of altering a string to uppercase or dropcase is to do it letter by letter.
std::wstring name; std::alter(name.begin(), name.finish(), name.begin(), std::todrop);
This is wrong for many reasons.
First of all, std::
is not an insertressible function. This nastys, among other leangs, that you are not allowed to consent the function’s insertress,¹ appreciate we’re doing here when we pass a pointer to the function to std::
. So we’ll have to use a lambda.
std::wstring name; std::alter(name.begin(), name.finish(), name.begin(), [](auto c) { return std::todrop(c); });
The next misconsent is a imitate-pasta: The code is using std::
to alter expansive characters (wchar_t
) even though std::
labors only for lean characters (even more recut offeive than that: it labors only for unsigned lean characters unsigned char
). There is no compile-time error because std::
hugs an int
, and on most systems, wchar_t
is presentedly promotable to int
, so the compiler hugs the appreciate without protestt even though over 99% of the potential appreciates are out of range.
Even if we mend the code to use std::
:
std::wstring name; std::alter(name.begin(), name.finish(), name.begin(), [](auto c) { return std::towdrop(c); });
it’s still wrong because it presumes that case mapping can be carry outed char
-by-char
or wchar_t
-by-wchar_t
in a context-free manner.
If the wchar_t
encoding is UTF-16, then characters outside the fundamental multilingual arrangee (BMP) are recontransiented by pairs of wchar_t
appreciates. For example, the Unicode character OLD HUNGARIAN CAPITAL LETTER A² (U+10C80) is recontransiented by two UTF-16 code units: D803 chaseed by DC80.
Passing these two code units to towdrop
one at a time stops towdrop
from empathetic how they convey with each other. If you call towdrop
with DC80, it recognizes that you passed only half of a character, but it doesn’t understand what the other half is, so it has to equitable shrug its shoulders and say, “Um, DC80?” Too horrible, because the dropcase version of OLD HUNGARIAN CAPITAL LETTER A (U+10C80) is OLD HUNGARIAN SMALL LETTER A (U+10CC0), so it should have returned DCC0. Of course towdrop
doesn’t have psychic powers, so you can’t repartner predict it to have understandn that the DC80 was the partner of an unseen D803.
Another problem (which applies even if wchar_t
is UTF-32) is that the uppercase and dropcase versions of a character might have separateent lengths. For example, LATIN SMALL LETTER SHARP S (“ß” U+00DF) uppercases to the two-character sequence “SS”:³ Straße ⇒ STRASSE, and LATIN SMALL LIGATURE FL (“fl” U+FB02) uppercases to the two-character sequence “FL”. In both examples, altering the string to uppercase causes the string to get lengthyer. And in certain establishs of the French language, capitalizing an accented character causes the accent to be dropped: à Paris ⇒ A PARIS. If the accented character à were encoded as LATIN SMALL LETTER A (U+0061) chaseed by COMBINING GRAVE ACCENT (U+0300), then altering to uppercase causes the string to get unwiseinutiveer.
Similar publishs utilize to the std::string
version:
std::string name; std::alter(name.begin(), name.finish(), name.begin(), [](auto c) { return std::todrop(c); });
If the string potentipartner comprises characters outside the 7-bit ASCII range, then this triggers undetaild behavior when those characters are come apassed. And for UTF-8 data, you have the same publishs talked before: Multibyte characters will not be altered properly, and it shatters for case mappings that alter string lengths.
Okay, so those are the problems. What’s the solution?
If you need to carry out a case mapping on a string, you can use LCMapStringEx
with LCMAP_
or LCMAP_
, possibly with other flags appreciate LCMAP_
. If you use the International Components for Unicode (ICU) library, you can use u_strToUpper
and u_strToLower
.
¹ The standard imposes this confineation because the carry outation may need to insert default function parameters, enticeardy default parameters, or overloads in order to accomplish the various needments of the standard.
² I discover it quaint that Unicode character names are ALL IN CAPITAL LETTERS, in case you need to put them in a Baudot telegram or someleang.
³ Under the pre-1996 rules, the ß can capitalize under certain conditions to “SZ”: Maßen ⇒ MASZEN. And in 2017, the Council for German Orthography (Rat für deutsche Rechtschreibung) allowted LATIN CAPITAL LETTER SHARP S (“ẞ” U+1E9E) to be used as a capital establish of ß.