Page 1 of 1

String Manipulation

#1 Curtis Rutland  Icon User is online

  • (╯□)╯︵ (~ .o.)~
  • member icon


Reputation: 4402
  • View blog
  • Posts: 7,644
  • Joined: 08-June 10

Posted 16 September 2013 - 07:36 PM

*
POPULAR

Learning C# Series
String Manipulation

What can we do with strings? More importantly, what can't we do with strings? These are two extremely important questions that anyone who wants to be a good C# programmer should know the answers to.

This tutorial will cover the techniques for basic string manipulation, as well as some of the more important "gotchas" that go along with working with strings.

Why is this important to learn about?

Strings are probably the most common representation of data, if you stop and think about it. Of all the programs you've written or worked with, how much of the data it dealt with was in a text format? That's why strings are so important: they are a computer's representation of text. Learning how to properly manipulate text will greatly benefit all your future work.

Definitions of terms used.

string: "In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and/or the length changed, or it may be fixed (after creation). A string is generally understood as a data type and is often implemented as an array of bytes (or words) that stores a sequence of elements, typically characters, using some character encoding. A string may also denote more general arrays or other sequence (or list) data types and structures."
Concatenation: "...string concatenation is the operation of joining two character strings end-to-end"


Notes:
  • All examples were created using Visual Studio 2012, targetting the .NET Framework 4.5. We'll do our best to point out anything that might not work in older versions.
  • The definition mentions two kinds of string: a literal constant and a variable. A previous tutorial on built-in types covers the difference between literals and variables, and gives examples of string literals. If you are still confused on this concept, please leave a comment and we will cover it more deeply. From this point on, the tutorial will assume you understand how to use variables and string literals.


String Manipulation

So, on to the tutorial. We'll start by listing a few useful operations. Then we'll list methods and discussing their use, in order of what we feel most used/useful. For a full reference of the String type that includes all methods, follow this link: http://msdn.microsof...ng_methods.aspx

Basic Strings

Strings are all of type System.String. You might have noticed that many online resources use the lower-case string rather than the capital String. Don't worry about the difference. C# uses some keywords as aliases for built-in types. So you can use them interchangeably (just like int and Int32), though I'd suggest picking one and sticking with it.

Strings are delimited by the double-quote mark, whereas characters are delimited by single-quotes. Important note: if you're ever writing code in something like a word processor, or you've copied code from certain web sites, you might run into serious issues. Word processors and some sites automatically replace straight quotes ( ' , " ) with curved quotes ( ‘ ’ , “ ” ). The C# compiler doesn't honor those as string or char delimieters, so make sure to avoid them or replace them with the correct characters.

Chars and strings are not interchangeable. Strings are, at their very core, an array of chars. So, individual pieces of strings can be represented as chars, and any char can be made into a string (basically an array with one element), but not all strings can fit into a char.

You can access any character in a string by treating the string as an array. Strings have their indexer overloaded, so you can directly access each character from the string itself. Here are a few examples showing how to access individual characters of a string:

string stringExample = "hello";
char third = stringExample[2];

for (int i = 0; i < stringExample.Length; i++)
{
    Console.WriteLine(stringExample[i]);
}

foreach (char c in "test")
{
    Console.WriteLine(c);
}




Strings can be declared as literals, references to other strings, or by calling the string constructor with a character array. Examples:

string s1 = "hello";
//using a char array
char[] chars = {'w', 'o', 'r', 'l', 'd'};
string s2 = new string(chars);
//using a char array without storing it separately
string s3 = new string(new[] {'F','o','o'});
string s4 = s1;


Concatenation

Strings can be concatenated (joined) by using the + operator. When you concatenate two strings, you end up with a third, completely new string. Example:

string s1 = "this is";
string s2 = "a test";
string combined = s1 + " " + s2; 
//combined now contains "this is a test"


You can also use the += operator to concatenate a string with another string and store it in the first variable:

string s1 = "this is";
s1 += " a test";
//s1 now contains "this is a test"


note: see section on string immutability

String Equality

If you've used Java before, you're probably used to seeing things like string1.Equals(string2). They do that because Java did not overload String's equality operator. Because of this, it inherited its default behavior, which is to determine if two reference types are pointing to the same reference.

In .NET, Object uses the same behavior. However, the C# developers overloaded the == and != operators to actually compare string contents rather than string references.

The upshot of this is that you can compare strings the same way you compare most other types, using the == and != operators. To achieve the Java behavior, you can cast your strings to objects, or use Object.ReferenceEquals:

string s1 = "test";
string s2 = new StringBuilder("te").Append("st").ToString();
Console.WriteLine(s1 == s2);                      //true
Console.WriteLine((object)s1 == (object)s2);      //false
Console.WriteLine(Object.ReferenceEquals(s1, s2));//false



note: for more information, see the StringBuilder section. For more information as to why that is necessary for this example, see the String Interning section.

Common String Methods

So, we've covered the basics of strings. What they are, how to create them, how to join them, etc...Now let's delve into some of the more advanced topics, like using the methods that string exposes.

Split

Splitting strings is probably one of the most common questions we've seen on this forum. Which makes sense, really. It's easy to load text in big chunks, like lines of a file, or the entire file itself. But sometimes you need to operate on individual parts of a string. For that, we use the Split method defined on string.

String.Split can split strings around either single chars, or by other strings. The benefit of using another string is that you can split against a sequence of characters. The downside is that there are no overloads that let you give just a string, or a list of strings. If you need to split a string around a substring, you must pass a string array and a StringSplitOptions value (None or RemoveEmptyEntries). This is less complicated than it sounds, as you can see in the examples provided.

When you split a string, you get an array of strings back. We call the pieces of the original string "tokens".

string test1 = "this.is.a.test";
string test2 = "another, test! really.";
string test3 = "split around || double pipes";

//using a single char
string[] tokens1 = test1.Split('.');

//using many chars
string[] tokens2 = test2.Split(',', ' ', '!');

//using many chars with StringSplitOptions
string[] tokens3 = test2.Split(new[] {',', ' ', '!'}, StringSplitOptions.None);

//using strings and StringSplitOptions
//note, there is no method that splits around a single string
//also, all overloads that use strings as splitters require a StringSplitOption
string[] tokens4 = test3.Split(new[] { "||" }, StringSplitOptions.RemoveEmptyEntries);



Reader exercise: write loops that will print the value of each element. This will help you learn how each overload behaves, as well as the difference between the StringSplitOptions.

Aggregate

Not technically a string method, but one that operates on a collection of strings to return a single string. What we like to think of as the logical opposite of splitting a string. To fully understand the example, you need to understand Lambdas, and possibly LINQ, but we still think it's an important tool to know.

Aggregate is not defined on String, nor is it defined on on Array. It's an extension method defined in the System.Linq namespace, and operates on IEnumerable<T>. However, for this tutorial, we'll limit it to just strings.

What the Aggregate method lets you do is basically stitch together an array (or other collection) of strings. You provide a function that tells the Aggregate method how to combine the individual elements. We usually provide this method in the form of a Lambda expression. Here's an example:

string test = "this is a test";
string[] tokens = test.Split(' ');
//we're going back to where we started!
string test2 = tokens.Aggregate((aggregate, current) => aggregate + " " + current);



Without going too deeply into what's happening, we've told the code to take all the strings in the array, and join them with spaces in between. Behind the scenes, the code loops through the array, and performs the operation you specified on each element. The aggregate parameter is the result of each previous operation, and the current parameter represents the current string in the loop. The result of the operation is stored in the aggregate variable and passed back in again on the next round of the loop.

Substring

Sometimes, you need to get part of a string, but you do not have an easy pivot to split against. That's no big deal, you can always grab any part of a string if you know where to start (and optionally, how big of a part you want). A piece of a string is called a substring. Very simple:

string test = "this is a test";
string sub1 = test.Substring(5);    //"is a test"
string sub2 = test.Substring(5, 2); //"is"



Format

Format is probably one of the most important string methods for you to learn, if not the most important. Format lets you take a "format string" and some arguments, and it will replace the format items in the format string with a formatted representation of each argument in the correct position. Sounds way more complicated than it is. Basic example:

int one = 1;
string test = string.Format("The number {0}", one);
//test now contains "The number 1"



Format items are represented by a number surrounded by curly brackets ( "{0}" ). You can include multiple format items in your format string:

DateTime date = DateTime.Today;
int number = 5;
string hi = "Hello!";
string formatString = "{0}{3}Today is the {1}th day of the month. {2}{3}See?";
string result = string.Format(formatString, hi, number, date, Environment.NewLine);


Notice how each format item has a number. Each number corresponds to one of the parameters that follows the format string, starting from zero. Notice how we've used the format items out of order, and some more than once.

We could (and indeed likely will) write an entire tutorial on formatting strings. It's extremely flexible, and you can create your own custom formatters to further extend its capabilities.

Also note that Console.WriteLine has overloads that match string.Format's overloads. This is because under the covers, Console.WriteLine will call string.Format, so you can just use Console.WriteLine if you're only outputting to the Console:

Console.WriteLine("This is {0}!", "awesome");


Last note: String.Format is a static method. It cannot be called from a string itself, but only from the string class.

Further reading:
http://blog.stevex.n...ting-in-csharp/

IndexOf

Sometimes, you need to find where in a string a particular character (or sequence of characters) occur(s). To do this, we use the IndexOf method:

string test = "test";
int index1 = test.IndexOf("e");            //1
int index2 = test.IndexOf("st");           //2
int index3 = test.IndexOf("somethignelse");//-1


Pretty self explanatory. If the search value is not found, it will return -1. Otherwise, it will return the index of the first instance it finds. If you search for a string, it will return the index of the first character of the first occurrence.

One particularly useful use for IndexOf is when you need to get a substring, but you don't know exactly what index you need to start at. Use IndexOf to find the first match, then use that index for the start parameter of Substring.

Replace

Often, we need to take a string, swap out some of its characters, and replace them with something else. To do this, we use the Replace method. Replace can take either a char or a string, and replace all instances of it with another char or string (respectively). Also note that you can (and really should) use Replace to remove characters from a string, by replacing with String.Empty (which is a static representation of an empty string). Don't confuse this with the Remove method, which removes by index rather than character. If you need to remove all instances of a particular string or character in another string, use Replace rather than Remove.

string bad1 = "s0mething";
string bad2 = "b@#d";
string bad3 = "is. happening.";

string fixed1 = bad1.Replace('0', 'o');
string fixed2 = bad2.Replace("@#", "a");
string fixed3 = bad3.Replace(".", String.Empty);



Other usefull string methods

I won't go into each of these as deep.

Trim

You can clip whitespace (or specific characters) from the ends of strings using Trim. If you only want to trim one side, use TrimStart and TrimEnd:

string bad1 = "   this is a string   ";
string bad2 = "abthis is a stringab";
string bad3 = "  this is a string";
string bad4 = "this is a string   ";

string fixed1 = bad1.Trim();
string fixed2 = bad2.Trim('a', 'b');
string fixed3 = bad3.TrimStart();
string fixed4 = bad4.TrimEnd();
//all fixed strings contain "this is a string"


Cases

Converting a string to all upper-case or all lower-case is a simple matter:

string up = "abcde".ToUpper();
string down = "ABCDE".ToLower();


Please use this responsibly. People love to think that they can do a case-insensitive comparison with .ToLower or .ToUpper, but there are better tools for the job.

Equals

We mentioned earlier that you don't have to use the Equals method to compare strings. However, you don't always want to compare apples to apples. Sometimes, we want to compare strings regardless of case, or even culture. There is an overload on String.Equals that allows you to do these comparisons. Read up here:

http://msdn.microsof...y/c64xh8f9.aspx
http://stackoverflow...son-enumeration

Checking Null and Empty

There's a static method on the string class called String.IsNullOrEmpty. This returns true if a string equals either null or a completely empty string. Frequently useful. In .NET 4.0, another method was added, String.IsNullOrWhiteSpace. This behaves similarly to the previous method, but will also return true if the string is nothing but whitespace. In effect, it's the equivalent of this code: String.IsNullOrEmpty(str) || str.Trim().Length == 0, though the MSDN claims IsNullOrWhiteSpace performs better.

Escape Sequences

We know that there are some characters that cannot be represented by text, but they do exist. We call these Non-Printable Characters. Some of these are more useful than others, like newline characters, but how do we use them?

In C# and many other C-like languages, we can represent certain characters with "Escape Sequences", which are a backslash ( \ ) followed by a sequence of characters. For example, to enter a "tab" character into a string, you use '\t'. \r matches the carriage return (CR) whereas \n matches the line-feed character (LF). Since Windows likes to use CRLF line endings, we can represent that with "\r\n". On UNIX systems, newlines are just line feeds ("\n").

Another notable escape sequence is the Unicode escape, which follows the general pattern of '\uXXXX', where XXXX is the hex code of the character in the unicode tables. For example, set a string to "\u263A" and print it to the console. See what comes out ?

Lastly, how do we represent characters that strings themselves use? How can we make a string with a backslash, or with a quotation mark? Simple, the escape sequence for a backslash is '\\' (two backslashes). If you set a string to "\\" then print it, you will see that only one backslash is printed. The escape sequence for quote marks is '\"'. Set a string to "\"", and you'll see that it contains one quote mark.

Verbatim Strings

Sometimes, the last thing we want is to escape a string. For example, a file path. We know that windows uses backslashes for file paths, so naturally a path might be "C:\Test.txt". Unfortunately, that will probably report a file not found error, because there is no file that starts with the tab character! So, we either make our file paths like this: "C:\\test.txt", or we use Verbatim Strings. Verbatim strings do not honor escape sequences, and can span line breaks (which normal strings cannot).

To create a verbatim string, simply prefix a normal string with an at symbol ( @ ) :

string path = @"c:\test.txt";


Note that the at symbol is outside of the quote marks.

There is one problem: if you can't use escape sequences, how can you embed a quotation mark in a verbatim string? Use two:

string s = @"this is a ""test""";
Console.WriteLine(s); //output: this is a "test"


Gotchas and other things you may not know about strings

Immutability

By this point in the tutorial, you may have noticed something: all the string methods seem to return a value, rather than changing the string in place. That's because strings in C# are immutable, which means that once a string is created, it cannot be modified in any way. This throws lots of people off, because they don't really understand what that means. Here's a common question: "If strings are immutable, why can I do this?"

string example = "test";
example = "test2";


"That's mutating a string!" Actually, it isn't. Remember that strings are reference types. So what you're actually changing there is a memory address, not a string. You've said that the variable example now points to "test2" (which will be somewhere in the intern pool, read the next section for more info), but the old "test1" string didn't change. It just doesn't have anyone pointing at it anymore.

There are several justifications for this behavior, such as thread safety, but like it or not, you have to get used to it. There's officially no way to modify strings in-place. The implications of this, as previously mentioned, are that all the string methods return a new string (or a reference to a string in the intern pool, again, see next section) rather than operating directly on the string in question.

Interning

Further reading: http://en.wikipedia....tring_interning , http://msdn.microsof...ing.intern.aspx

In C# (and many other languages, Java included), string literals are pooled to reduce memory overhead. This has some benefits and downsides, and you could go your entire programming career and never run into any of them. For most programmers, this is just something that happens behind the scenes and keeps your application's memory footprint smaller than it would be. For some extremely performance/memory intensive applications, you might need to understand string interning and how it works.

The simple explanation is that if I re-use the same string literal, I'm not creating a new string. I'm re-using the same string from the last time I used it. There are gotchas here too, because interning doesn't always behave exactly like you might expect. Consider the following example:

string s1 = "test";
string s2 = "test";
string s3 = "te" + "st";
string s4 = new StringBuilder("te").Append("st").ToString();
string p1 = "te";
String p2 = "st";
string s5 = p1 + p2;
Console.WriteLine(ReferenceEquals(s1, s2));  //true
Console.WriteLine(ReferenceEquals(s1, s3));  //true
Console.WriteLine(ReferenceEquals(s1, s4));  //false
Console.WriteLine(ReferenceEquals(s1, s5));  //false



s3 is one you might expect not to match s1, because it is composed from two other literals. However, If you check the IL, you'll notice that the compiler has optimized away the concatenation:

  IL_0001:  ldstr      "test"
  IL_0006:  stloc.0
  IL_0007:  ldstr      "test"
  IL_000c:  stloc.1
  IL_000d:  ldstr      "test"
  IL_0012:  stloc.2



For those of you that don't read IL, just notice how there are three ldstr "test" instructions instead of two. Then, when the runtime gets instructions to load three strings onto the stack, it will push the same memory location three times.

As we said, it's entirely possible to go through a business programming career without ever understanding this concept. Not advisable, but you rarely are required to know this. Side note: string interning is possible because of string immutability. This is also why it is necessary; if you can never change strings, you might end up with a big pile of them, so might as well re-use them for efficiency.

String Building, and StringBuilder

You may have noticed two things at this point: 1) We've used a class called StringBulder in my examples a few times, but haven't explained it, and 2) if strings are immutable, what do you do if you really need a mutable string?

Use a StringBuilder, of course. But what is it, and why would we use one? A StringBuilder represents a mutable string of characters. Sounds simple enough. You can almost think about it like a List<char> compared to a char[]. It's not, but it the comparison is useful. You can add characters to a string builder, delete them, insert them at specific indexes; all the things you can't do to a string.

When to use it and why? I'll let Microsoft do the explaining. From the last link:

Quote

Consider using the String class under these conditions:
  • When the number of changes that your app will make to a string is small. In these cases, StringBuilder might offer negligible or no performance improvement over String.
  • When you are performing a fixed number of concatenation operations, particularly with string literals. In this case, the compiler might combine the concatenation operations into a single operation.
  • When you have to perform extensive search operations while you are building your string. The StringBuilder class lacks search methods such as IndexOf or StartsWith. You'll have to convert the StringBuilder object to a String for these operations, and this can negate the performance benefit from using StringBuilder. For more information, see the Searching the text in a StringBuilder object section.


Consider using the StringBuilder class under these conditions:
  • When you expect your app to make an unknown number of changes to a string at design time (for example, when you are using a loop to concatenate a random number of strings that contain user input).
  • When you expect your app to make a significant number of changes to a string.


Different people have tested the performance implications and come up with different results, so its hard to give an actual value to "extensive" or "significant number of". Experiment with it, you'll understand when its appropriate.

As to why we've used StringBuilder in the interning examples, that's because we needed an example where the compiler wouldn't optimize our "bad" code into good code. Since the compiler optimizes "te" + "st" to "test", we used a StringBuilder to concatenate the strings together.

In Conclusion

Strings are deeply important to .NET programming. We haven't covered all string methods, just the more common ones. More may be edited into this tutorial as needs be. Please leave any questions you have in the comment section.

See all the C# Learning Series tutorials here!

More reading:
MSDN String Programming Guide

Is This A Good Question/Topic? 7
  • +

Replies To: String Manipulation

#2 Mylo  Icon User is offline

  • Knows all, except most.

Reputation: 265
  • View blog
  • Posts: 747
  • Joined: 11-October 11

Posted 16 September 2013 - 08:59 PM

I was actually reading up on strings (I'm learning C#) as I noticed this Thread. It definitely will have saved some time. Thank you.

Also, an error:
string[] tokens4 = test1.Split(new[] { "||" }, StringSplitOptions.RemoveEmptyEntries);

should be
string[] tokens4 = test3.Split(new[] { "||" }, StringSplitOptions.RemoveEmptyEntries);

This post has been edited by Mylo: 16 September 2013 - 09:05 PM

Was This Post Helpful? 2
  • +
  • -

#3 Curtis Rutland  Icon User is online

  • (╯□)╯︵ (~ .o.)~
  • member icon


Reputation: 4402
  • View blog
  • Posts: 7,644
  • Joined: 08-June 10

Posted 17 September 2013 - 06:02 AM

Thanks for the catch, fixed.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1