Page 1 of 1

The difference between C-strings and string literals Rate Topic: -----

#1 BlueGalaxy   User is offline

  • New D.I.C Head
  • member icon

Reputation: 0
  • View blog
  • Posts: 4
  • Joined: 08-June 17

Posted 29 May 2018 - 12:00 PM

Hello, programmers! You probably all know that a C-string is just a character array. There was a very good tutorial on this subject by v0rtex. If you haven't read that tutorial, I would reccomend that you read and understand it first, then come back to my tutorial.

The link to v0rtex's tutorial:
https://www.dreaminc...-style-strings/

My tutorial is intended to be an extension to v0rtex's tutorial. I aim to answer the question, what exactly is a string literal, and how is it different from a C-string?
From reading the previous tutorial, you all know that to create a C-string we do:
char string[] = "hello";


What really happens behind the scenes is that, the string literal "hello" is copied char by char into the array. If the size is not specified, the compiler will automatically allocate just enough space in the char array to store each one of the chars.

But what is the string literal itself? Where is it located in the memory? A string literal is just a bunch of chars laid out one after another in consecutive order, much like a char array. The difference is that a char array is explicitly defined by the programmer, but a string literal is defined by the compiler. Where does the compiler put the string literal? A string literal is stored in the read-only text segment of the program's memory. This is a special region in the memory where data can be read, but it can't be written to or modified. The read only memory is completely separate from the stack memory, where data defined by the programmer is stored, such as ints and char arrays.

Since C-strings are a feature of the C programming language that C++ inherited, the code for this tutorial will be in C. This also applies to C++ too, because C++ inherited these features from C. Please look at the source code, then read my explanation below it.

#include <stdio.h>   // C standard input/output - for printf()
#include <stdlib.h>  // C standard library      - for EXIT_SUCCESS and size_t
#include <ctype.h>   // char type functions     - for toupper()

int main(void) {
  size_t i;

  // char* s stores the base address of the string literal
  // it points to the first char in the string literal, 'h'
  char* s = "hello";

  for (i = 0; *(s+i) != '\0'; i++) {
    printf("%c", *(s+i));
  }
  printf("\n");

  // Segmentation fault!
  /*
  s[0] = (char) toupper(s[0]);
  *(s+0) = (char) toupper(*(s+0));
  */

  printf("\n");

  // "Unroots" the char* s!
  for (; *s != '\0'; s++) {
    //printf("%c", *s);
    printf("%p\t%c\t%hhu\n", s, *s, *s);
  }
  printf("\n");

  printf("%p\t%c\t%hhu\n", s, *s, *s);

  printf("\n");

  char* const r = "hello";

  for (i = 0; *(r+i) != '\0'; i++) {
    printf("%c", *(r+i));
  }
  printf("\n");

  printf("\n");

  for (i = 0; *(r+i) != '\0'; i++) {
    printf("%p\t%c\t%hhu\n", (r+i), *(r+i), *(r+i));
  }

  /*
  for (; *r != '\0'; r++) {
    printf("%c", *r);
  }
  printf("\n");
  */

  printf("\n");

  char string[] = "hello";
  
  printf("%p\n", r);
  printf("%p\n", string);

  return EXIT_SUCCESS;
}



This is the output that is generated when the program is run:

hello

0x400854	h	104
0x400855	e	101
0x400856	l	108
0x400857	l	108
0x400858	o	111

0x400859		0

hello

0x400854	h	104
0x400855	e	101
0x400856	l	108
0x400857	l	108
0x400858	o	111

0x400854
0x7fff07f67090




Detailed Explanation:

Let's take a look at the main() function. There are a couple of constructs that I would like to point out.
int main(void) {
  size_t i;

  // more code here

  return EXIT_SUCCESS;
}


A function prototype specifies the number and types of the functions formal parameters. In the C programming language, f() means that a function that takes an arbitrary amount of arguments. To say that the function takes no arguments, use f(void) in C. In C++, f() means that the function takes no arguments. C++ also supports the f(void) syntax for backwards compatibility with C. Since we are writing C code, we need to use int main(void) to specify that main() takes no arguments.

size_t is an unsigned integer data type. It is returned by the sizeof() operator as well as by strlen() function. It is usually used in programming to store array indexes and sizes of data objects such as C-strings.

We usually return 0; to indicate a successful program. However, 0 may not indicate a successful program on all processors and/or operating systems. EXIT_SUCCESS macro is defined to indicate a successful program for the computer on which the program runs, which may be something other than 0 (because 0 is already reserved another meaning by the OS). I like to use return EXIT_SUCCESS; because it is more readable than return 0;

The first line of code:
char* s = "hello";


This sets the char* s to point to the first char in the string literal, which is stored in the read-only text segment.

I use the next loop to traverse the string literal and print each char one at a time.
  for (i = 0; *(s+i) != '\0'; i++) {
    printf("%c", *(s+i));
  }


I think that you should be able to understand pointer arithmetic. *(s+i) is just another way of writing s[i]. I used this one to demonstrate a point. We are adding to the char* s to generate a new address (pointer value). The char* s sits at the base address of the string literal. It is not "unrooted" from this position.

The next two lines are commented out because they cause a segmentation fault!
  s[0] = (char) toupper(s[0]);
  *(s+0) = (char) toupper(*(s+0));


A segmentation fault may be caused when the program tries to have access to memory it does not have permision. Here we are converting the first char in the string literal to upper case. Because toupper() returns an int, we need to typecast it into a char. The problem occurs at the = assignment operator. The string literal is in the READ-ONLY text segment of the memory. It is forbidden to modify anything in this region of memory. That is why you get a segmentation fault.

The next for-loop has several interesting features:
  // "Unroots" the char* s!
  for (; *s != '\0'; s++) {
    //printf("%c", *s);
    printf("%p\t%c\t%hhu\n", s, *s, *s);
  }
  printf("\n");


Unlike the previous example, this one increments the char* s itself. Instead of sitting at the base address of the string literal, the char* s is "unrooted" and it travels along the string literal until it finds the '\0' char marking the end of the string literal. The initial statement of the for-loop is empty because we do not need the i.
"%p\t%c\t%hhu\n"


In this string, I am printing the address, or the pointer value, of char* s. Then there is a tab. Then I am printing the char value that char* s is pointing to. I am also printing the char value as an unsigned integer, using %hhu. "hh" is a modifier that tells the size of the data type. An int is 32 bits. This is the default. "h" stands for half, means 16 bits (short size). "hh" stands for half half, means 8 bits (char size). Half of a half is one fourth.

Look at the output on the screen! We can see at which address each char is stored in the string literal. After the loop, char* s now points to the '\0' char. It is directly after 'o'. '\0' is not printable char, but it has an ASCII value 0. It is clear that the char* s now points to a different memory location. Now you lost access to the string literal. How will you print it again? This is why you should only use the first for-loop for printing the string literal char by char, because it does not "unroot" the pointer!

Another solution to this problem is to "glue" the pointer firmly at the base address. To do this we use a char* const, a char pointer that cannot itself be modified.
  char* const r = "hello";


Notice that the const qualifier is with the pointer. It means that the pointer itself is const, it cannot be "unrooted" from the base address.

You may also see code that looks like this:
  const char* r = "hello";


Notice that the const qualifier is with the char. It is with the base data type, not the pointer itself. This means that the data which the pointer points to cannot be modified, but the pointer itself may be set to point to another memory location. This is useful in the following scenario:
  char food[] = "apples";
  const char* r = food;


You want to create a char* to the base address of a C-string, but you do not want the pointer to be able to modify the array. However, string literals are forbidden to be modified because they are read only, not writeable. This rule is enforced by the compiler, not the programmer. You should use the following syntax:
  const char* s = "hello";


When attempting to modify the string literal, you get an error at compile time, which is better than a segmentation fault at run time.

Let us return to the original example.
  for (i = 0; *(r+i) != '\0'; i++) {
    printf("%p\t%c\t%hhu\n", (r+i), *(r+i), *(r+i));
  }


The address, the char, and the ASCII number of each letter in the string literal is printed. Notice that we are offsetting i elements away from char* const r, which points to the first char in the string literal. Hmm... The address that r points to is the same address that s was pointing to. Usually, when you have more than one same string literals in your code, they are really just the same string literal in more than one place. They are not different string literals.

Theoretically, if the compiler would have allowed you to change string literals, changing the string literal that s is pointing to would change the string literal that r is pointing to. Both char pointers are pointing at the exact same memory location.

Then next piece of code is commented out.
  /*
  for (; *r != '\0'; r++) {
    printf("%c", *r);
  }
  printf("\n");
  */


Because r is a char* const, the pointer itself cannot be modified or "unrooted". It generates an error during compile time.
string_literals_tute.c: In function ‘main’:
string_literals_tute.c:45:23: error: increment of read-only variable ‘r’
   for (; *r != '\0'; r++) {
                       ^



At the end of the code, I create a C-string (char array) and assign that same string literal "hello" to it. This copies every char one by one from the string literal into the C-string. The string literal itself remains unchanged.
  char string[] = "hello";
  
  printf("%p\n", r);
  printf("%p\n", string);


Now I print out the pointer value (address) of the first char in "hello", that r points to, and the pointer value of the first char in the C-string. As you can see, the addresses are different! It means that the C-string is a separate data! Unlike a char*, it does not point to the string literal itself.

Also notice that the address of the C-string is much larger than the address of the string literal.
0x400854
0x7fff07f67090


The pointer value, or address, of the string literal is a 6 digit hexadecimal number,
and the pointer value of the C-string is a 12 digit hexadecimal number. The leading zeroes are not shown here. Even if you know nothing about the hexadecimal number system, you can clearly see that the C-string, being stored on the stack, is in a memory location very very far away from the read-only text segment where the string literal is stored.

Take a look at this diagram here.

Posted Image

As you can see, the stack, where all the local variables (also including arrays and structs) are stored, is way far above the place where the string literals are stored. 2^N - 1 represents the largest possible address that some data could have. This is like looking at the whole entire memory on your computer as a byte-addressed array. If you array is 2^N in size, then the last element is located at 2^N - 1. This is just a simplified diagram, but the general concept is the same.

In summary:
1. String literals are stored in the read-only text segment of the program's memory. They cannot be modified.
2. C-strings are initialized by copying every char from the string literal into the C-string.
3. If you want to create a pointer to a string literal so that you can read or print it later:
  const char* const p = "hello";


The const with the char (base data type) prevents the internal data that the pointer points to from being modified (so you don't get a segmentation fault). The const with the pointer prevents the pointer from being "unrooted" from its position at the base of the string literal.

Is This A Good Question/Topic? 0
  • +

Page 1 of 1