Correctly reversing a string C# with Example



Correctly reversing a string C# with Example

Most times when people have to reverse a string, they do it more or less like this: 
char[] a = s.ToCharArray(); 
System.Array.Reverse(a); 
string r = new string(a); 
However, what these people don't realize is that this is actually wrong. 
And I don't mean because of the missing NULL check. 
It is actually wrong because a Glyph/GraphemeCluster can consist out of several codepoints (aka. characters). 
To see why this is so, we first have to be aware of the fact what the term "character" actually means. 
Reference: 
Character is an overloaded term than can mean many things. 
A code point is the atomic unit of information. Text is a sequence of code points. Each code point is a 
number which is given meaning by the Unicode standard. 
A grapheme is a sequence of one or more code points that are displayed as a single, graphical unit that a 
reader recognizes as a single element of the writing system. For example, both a and ä are graphemes, 
but they may consist of multiple code points (e.g. ä may be two code points, one for the base character a 
followed by one for the diaresis; but there's also an alternative, legacy, single code point representing this 
grapheme). Some code points are never part of any grapheme (e.g. the zero-width non-joiner, or 
directional overrides). 
A glyph is an image, usually stored in a font (which is a collection of glyphs), used to represent graphemes 
or parts thereof. Fonts may compose multiple glyphs into a single representation, for example, if the 
above ä is a single code point, a font may chose to render that as two separate, spatially overlaid glyphs. 
For OTF, the font's GSUB and GPOS tables contain substitution and positioning information to make this 
work. A font may contain multiple alternative glyphs for the same grapheme, too. 
So in C#, a character is actually a CodePoint. 
Which means, if you just reverse a valid string like Les Mis e ́rables, which can look like this 
string s = "Les Mise\u0301rables"; 
as a sequence of characters, you will get: 
 

selbar ́esiM seL 
As you can see, the accent is on the R character, instead of the e character. 
Although string.reverse.reverse will yield the original string if you both times reverse the char array, this kind of 
reversal is definitely NOT the reverse of the original string. 
You'll need to reverse each GraphemeCluster only. 
So, if done correctly, you reverse a string like this: 
private static System.Collections.Generic.List GraphemeClusters(string s) 
{ 
System.Collections.Generic.List ls = new System.Collections.Generic.List(); 
System.Globalization.TextElementEnumerator enumerator = 
System.Globalization.StringInfo.GetTextElementEnumerator(s); 
while (enumerator.MoveNext()) 
{ 
ls.Add((string)enumerator.Current); 
} 
return ls; 
} 
// this 
private static string ReverseGraphemeClusters(string s) 
{ 
if(string.IsNullOrEmpty(s) || s.Length == 1) 
return s; 
System.Collections.Generic.List ls = GraphemeClusters(s); 
ls.Reverse(); 
return string.Join("", ls.ToArray()); 
} 
public static void TestMe() 
{ 
string s = "Les Mise\u0301rables"; 
// s = "noël"; 
string r = ReverseGraphemeClusters(s); 
// This would be wrong: 
// char[] a = s.ToCharArray(); 
// System.Array.Reverse(a); 
// string r = new string(a); 
System.Console.WriteLine(r); 
 

} 
And - oh joy - you'll realize if you do it correctly like this, it will also work for Asian/South-Asian/East-Asian languages 
(and French/Swedish/Norwegian, etc.)... 
string s = "Foo"; 
string paddedLeft = s.PadLeft(5); // paddedLeft = " Foo" (pads with spaces by default) 
string paddedRight = s.PadRight(6, '+'); // paddedRight = "Foo+++" 

0 Comment's

Comment Form