Page 1 of 1

Obfuscated Code - A Simple Introduction primarily written in C, but most examples also work in C++ Rate Topic: ***** 1 Votes

#1 jjhaag  Icon User is offline

  • me editor am smartastic
  • member icon

Reputation: 44
  • View blog
  • Posts: 1,789
  • Joined: 18-September 07

Post icon  Posted 25 November 2007 - 05:54 AM

A Simple Introduction to Obfuscated Code

Obfuscated code is code whose logic is intentionally difficult to follow and/or whose syntax is intentionally unclear. There are several reasons why one might move away from good programming practices to produce obfuscated code. These include intellectual property protection (preventing or slowing reverse engineering efforts on proprietary code), program security (making hacks and exploits difficult to find), or for recreational purposes.

In this tutorial, I'm focusing on a simple introduction to the fun kind of code obfuscation – this is not intended to help with code security or intellectual property protection. For some really cool examples of obfuscation, I recommend that you check out the International Obfuscated C Contest website. Try looking at it before and after reading this tutorial; you'd be surprised at how much more clear some of the examples on that site are after a demonstration of a few simple techniques.

Note that many of the techniques for obfuscation are just straight-up horrible coding practices. They may be syntactically valid, and produce executables that run without errors, but for real-world situations (excluding the protective cases given above), code should be clear, easy to read, and eminently maintainable.

The code presented in this tutorial is in C, but the vast majority of obfuscatory techniques are equally applicable to C++ (one notable exception being the ability to call main() recursively in C but not in C++).



Some Simple Techniques for Obfuscation

The easiest way to get started in writing obfuscated code is to think back to when you were starting programming, right at the very beginning. What were the most difficult concepts to get working correctly? Pointers and arrays, function calls, loops, and recursive functions are the first thing that come to my mind, both from my own experience and what I've seen people struggle with on this site. So those are probably some good places to start when trying to get your code nice and tough to follow.



Identifier Names: Unlearn All Those Good Programming Practices
Before we get to those concepts, let's first consider identifier names (function, class, variable, etc. names). Identifiers in C are relatively straightforward, and most introductions to these languages stress good programming practices when it comes to variable names. The names themselves should give a good idea of what they are supposed to be used for, and naming strategies should be consistent within your code. However, while this is good coding practice, it's not mandatory. The actual syntactical restrictions on identifier names names are straightforward – they may contain only the 52 alphabetical characters (C is case-sensitive, remember), the 10 numeric characters, and the underscore ( _ ) character, and they may not start with a digit. “Good” identifier names for obfuscated code include lots of similar identifiers, and lots of characters and digits that are easily confused, such capital o's and zeros, and names that can be mistaken for hexadecimal literals (e.g. Ox3f, with a capital 'o' instead of a zero). Used judiciously with a few implicit casts of character literals, this allows variable and function names that are difficult to distinguish from each other:
#include "stdio.h"

void _(int O, int _O, char _0) {
	for (; O<_O; ++O) {
		printf("%c\n",_0);
		_0++;
	}
}

int main() {
	_('O','o','O');
	return 0;
}

The above code uses a function named “_“ to print out all of the ASCII characters between 'O' (capital o) and 'n'. It also uses a for-loop without an initial value, and relies on the variable passed to the function having the proper initial condition – another great obfuscatory technique (especially if you've called many interdependent functions in sequence or are using recursion – more on recursion to follow).

Another (rather simplistic) way to use variable names in a deceptive way is to use the same names as are given by the argument list in a function prototype or definition, but to use them in a different order in the calling function than in the function header. The program presented below produces identical results to the one given above:
#include "stdio.h"

void _(int O, int _O, char _0) {
	for (; O<_O; ++O) {
		printf("%c\n",_0);
		_0++;
	}
}

int main() {
	int _0='O', O='o';
	char _O=79;
	_(_0,O,_O);
	return 0;
}


Just a quick note on identifiers in C++ - they are more restrictive than in C. This perhaps expains why most of the obfuscated code you will find is written in C instead of C++. If you want to know what you can get away with in C++ in terms of identifiers, check out the standard.


Arrays and Indexes: Where Pointers Come to Get Freaky

C/C++ arrays are based on pointers. There are situations in which a pointer to a particular datatype differs from an array of that same datatype, but I won't be getting into them here. The treatment of pointers and arrays presented here will deal with how the similarities between them can be exploited to produce code that is just plain weird.

Hopefully, you are familiar with the usage of the subscript operator [] for accessing a particular element of an array. The syntax array_name[index_variable] is used to access the (index_variable)th element of the array array_name. However, an equally valid way of accessing the same element involves recognizing that the name of an array, without subscripts, is basically an address, and can be used with pointer arithmetic to get to the element in question:
	int arr[]={0,1,2,3,4};  //declaration of array
	
	arr[ind];			   //element <index> of array_name
	
	arr;					//base address of array_name
	
	arr + ind;			  //pointer arithmetic - returns address of <ind>th 
							//element of <arr>
	*(arr_name+ind);	//dereference address of <ind>th element; 
							//equivalent to arr[ind]

This is (hopefully) nothing too new. However, this equivalence of arr[ind] and *(arr+ind) allows us to do the following fun little trick:
	arr[ind];
	//equivalent to:
	*(arr+ind);
	//...which is equivalent to:
	*(ind+arr);
	//...again, equivalent to:
	*(arr+ind);
	//...which, finally, is equivalent to:
	ind[arr];

Oh, yes, those lines are all equivalent. It's pretty shoddy coding practice, but it's completely valid to access element ind of the array arr using either arr[ind] or ind[arr]. Interspersing normal array access calls with inverted array/index access and pointer arithmetic access in the same function or line of code can make a complete hash of the original intent of the expressions.

One more thing should be noted about pointers here. A pointer stores an address, but an address is really just a number, so it is completely possible to store the value of a pointer (that is, the address of the pointee) in a normal integral datatype. The only restriction on this uasge is that the data type chosen must be sufficiently large to store the address. Fortunately, the size of a natural “word” on a machine generally determines both the maximum addressable space on the system, and the size of the int datatype. On most 32-bit machines, the maximum addressable memory space can be stored in a 4-byte number, and the size of an int on such machines is generally 4 bytes as well. However, this is not guaranteed, so shouldn't be counted on for code that is designed to be portable. Nonetheless, this is an interesting technique for obfuscation purposes, especially when passing “arrays” to functions. Using this technique requires passing the base address of the array as an implicitly or explicitly casted integral type (of sufficient size to hold the address), and then recasting it within the function to a pointer. The following snippet demonstrates this technique in printing an array of numbers:
#include "stdio.h"

void myFunction(int integer) {
	int i;
	for (i=0; i<10; ++i) {
		printf("%d ",*((int*)integer+i));
	}
}

int main() {
	int array[]={0,1,2,3,4,5,6,7,8,9};
	int integer=int(array);
	myFunction(integer);
	return 0;
}




The Conditional Operator ?: and the Comma Operator ,
The conditional operator ?: is frequently a source of confusion for those new to C/C++:
<variable_name> = (<test_condition>) ? (<value_if_true>) : (<value_if_false>);

Note that the parentheses are not obligatory in this case; they are included only to improve clarity. The same behavior can be achieved, though less succinctly, using an if-else statement:
if (<test_condition>) {
	<variable_name> = <value_if_true>;
}
else {
	<variable_name> = <value_if_false>;
}


The conditional operator is a very convenient little construct for writing efficient code and reducing the number of if-else statements within code. It is also a great way to obfuscate code, because while it shares many properties with an if-else statement, the fact that it is made up of punctuation characters rather than English words means that it's one step further removed from a human intelligible instructions. Replacing any if or if-else statements in your code with the appropriate ?: call is a great technique for obfuscating your code. However, the ?: operator usually only accepts 3 arguments: a test condition, a value that is returned when the condition is true, and a value that is returned when the condition is false. if-else expressions, on the other hand, are not limited to just a single expression within the block of code following the if or else statement. To get around this limitation, we can use the comma operator:
<variable_name> = (<test_condition>) ? (<expression_a1>, ..., <value_if_true>) : (<expression_b1>, ..., <value_if_false>);

The comma operator is used to evaluate multiple expressions where only a single expression is expected. The expressions are evaluated and discarded left-to-right, with the right-most argument being the value that is assigned. For instance, in the following code:
int a=0;
int b=3;
a = (b+=3, b-1);

the expression b+=3 is evaluated first, assigning b a value of 6, and then discarded. The variable a is then assigned a value of b-1, resulting in a=5. Combining the conditional operator and the comma operator often results in code that is quite difficult to follow. For instance, the following two snippets yield the same ultimate result, and follow the same chain of logic:
	int a=1;
	int b=2;
	int c=3;
	
	if (a<0) {
		b+=3;
		a=b-1;
	}
	else {
		c+=b;
		a=c+2;
	}


	int a=1;
	int b, c;
	a=(b=2, c=3, a<0) ? (b+=3, b-1) : (c+=b, c+2);

In both examples, the values of a, b, and c at the end of the code are 7, 2, and 5, respectively. But the chain of logic in the first is much more clear, unless you've had extensive experience with the ?: and , operators. Nest a few such operations (yes, the operations performed by the ?: operator can be nested), and you're well on your way to writing incomprehensible code.



Recursion

Recursive functions are perhaps one of the most difficult procedural programming concepts for new programmers to get their heads around. However, they can essentially be viewed as just another way of performing loops. Generally speaking, all types of loops in C/C++ are can be interchanged, at least with the proper modifications or additions. Similarly , most looping structures can be rewritten as recursive functions; however, it is frequently necessary to either pass the index variable and termination condition as parameters, or to declare them as static variables within the function. The static variable approach works well enough if you only need to use a recursive function at a single point within a program; it becomes more difficult to do if you need to call it from, for example, several different locations within your main().

The following two snippets accomplish the same end result – they print the contents of an array to the screen. The first uses a traditional if loop; the second uses a recursive function that passes the array, the index variable, and the termination condition as parameters to the function. Additionally, it uses some of the techniques mentioned in the array/pointer section above, by casting the address of the array as an int in main, recasting it as a pointer in the function, and using pointer arithmetic for array access:
#include "stdio.h"

int main() {
	int array[]={0,1,2,3,4,5,6,7,8,9};
	
	int i;
	for (i=0; i<10; ++i) {
		printf("%d ",array[i]);
	}
	
	return 0;
}

#include "stdio.h"

void recursiveFunction(int array, int i, int stop) {
	printf("%d ",*((int*)array+i));
	i++;
	if (i < stop) {
		recursiveFunction(array, i, stop);
	}
}

int main() {
	int array[]={0,1,2,3,4,5,6,7,8,9};
	recursiveFunction((int)array, 0, 10);
	return 0;
}


Another way to obfuscate code with recursion is to call the main() function itself recursively. This is legal in C, but is expressly illegal in C++. In C, however, it can make for some really weird looking code. Unti you've read over a few examples and understand what they are doing, nothing messes your head quite like seeing calls to main() strewn all over a program. The following program uses recursive calls to the main() itself to print the prime numbers between 0 and 20. It also uses the conditional operator ?: in three places instead of conditional expressions (if-statements):
#include "stdio.h"
#include "math.h"

int n=20;
int i, isPrime;

int main() {
	
	isPrime=1;
	for (i=2; i<=sqrt(n); i++) {
		isPrime *= (n%i == 0) ? 0 : 1;
	}
	
	(isPrime==1) ? printf("%d\n",n) : 0;
	n--;
	
	return (n>1 ? main() : 0);
}




Other Techniques for Obfuscation
This is a pretty basic introduction to obfuscation, though the techniques presented here can produce some downright mean-looking code. However, there are a couple of other techniques that I will mention briefly, before moving on to a full-blown (though necessarily simple) example of the process of obfuscation.

The first is formatting. C and C++ don't really care about whitespace for the most part, as long as it's not actually within the middle of keywords or identifiers (there are, however, potential restrictions on line length and similar factors due to minimum requirements in the appropriate standard, as opposed to absolute requirements). But you can basically make a program look like anything you want (remember those shape poems from elementary school) without affecting how it functions.

A more advanced technique is the use of the preprocessor. Most people don't give the preprocessor much consideration, aside from making sure that they have the proper #include directives for the headers they need, or have #ifdef / #endif in their custom header files to prevent multiple inclusion. However, the preprocessor can be a powerful (though sometimes finicky) ally for a coder, whether the goal is to produce beautiful, clean, easy-to-read code, or a giant pile of apparent garbage in the shape of a teddy bear. However, I won't cover the preprocessor here, as it's usage can be fairly complicated, and there is already a tutorial on the preprocessor on this site. Maybe someone will write a tutorial on obfuscating code focussing on the preprocessor.



Putting It All Together: A Simple Example
Chances are, the first program you ever wrote was the classic “Hello World!” - regardless of which language you started out in. Here is (one) C-version of this classic program:
#include "stdio.h"

int main() {
	printf("Hello World!");
	return 0;
}

As a first step, let's convert the C-string literal “Hello World!” into an array of numbers, and do the printing from a function other than main:
#include "stdio.h"

void myFunction(int array[], int arraySize) {
	int i;
	for (i=0; i<arraySize; ++i) {
		printf("%c", array[i]);
	}
}

int main() {
	int array[]={72,101,108,108,111,32,87,111,114,108,100,33};
	myFunction(array,12);
	return 0;
}

Well, that's a little “better”. Unless you happen to know the ASCII table by heart, you're not likely to immediately know what the program is going to output. However, it should still be pretty obvious that it's going to print out an array of 12 elements. So let's play around with the array a bit. Instead of an array, we'll pass a number that storing the address of the array, and instead of using arrays subscripting, we'll try some pointer arithmetic:
#include "stdio.h"

void myFunction(int integer, int arraySize) {
	int i;
	for (i=0; i<arraySize; ++i) {
		printf("%c", *((int*)integer+i));
	}
}

int main() {
	
	int array[]={72,101,108,108,111,32,87,111,114,108,100,33}, integer=(int)array;
	myFunction(integer,12);
	return 0;
}

Now let's try making the printing function recursive, instead of using a for loop:
#include "stdio.h"

void myFunction(int integer, int arraySize, int i) {
	(i<arraySize) ? printf("%c", *((int*)integer + i++ )), myFunction(integer, arraySize, i) : 0;
}

int main() {
	
	int array[]={72,101,108,108,111,32,87,111,114,108,100,33};
	myFunction((int)array,12, 0);
	return 0;
}

We also needed to add the index variable to the argument list of the recursive function, so that we can keep track of where we are in the array. And because the function is recursive, we need a conditional that determines whether the function will get called again. So the current value of the index is established as the test condition for a conditional operator. The recursive call is placed as the “important” expression (i.e. right-most expression) in the <value_if_true> position, in a comma-separated list with the printf statement. This allows the printf statement to execute, followed by the recursive call. A zero is included as the <value_if_false>, though it could really be anything we wanted. Note that the conditional operator results in the creation of a temporary variable storing the result of the operation; in this case, we don't care about the result and are just throwing it away after each function call.

That array is still a little obvious, and anyone with an ASCII table could still figure this one out pretty easily. One thing that stands out to me is that the string contains 3 lower-case L's; this seems ripe for some manipulation. Let's take those out of the array, and insert a boolean statemen to print an 'l' if the index variable is at the proper positions in the array. Of course, if we do this thoughtlessly, we'll wind up either in an infinite loop or skipping some characters, because we'll still be incrementing the index variable each time. So let's add a second index variable to keep track of the number of 'l's that we've printed, and subtract that number from the index variable to access the proper array element:
#include "stdio.h"

void myFunction(int integer, int arraySize, int i, int k) {
	(i<arraySize) ? 
			(i==2)||(i==3)||(i==9) ?
					k++, i++, printf("%c",108) :
	printf("%c", *((int*)integer + i++ - k)), myFunction(integer,arraySize, i, k) :
			0;
}

int main() {
	
	int array[]= { 72, 101, 111, 32, 87, 111, 114, 100, 33 };
	myFunction((int)array, 12, 0, 0);
	return 0;
}

Note that we still need to increment the i variable, even if we're not printing a character from the array (here is where we could get stuck in an infinite loop, which we prevent by subtracting the value of k from i). We're also now passing the second index variable as well, so that we can keep track of the number of 'l's that we've printed out.

Now let's combine the two printf calls into a single call, by shifting around the location of the conditional operators:
#include "stdio.h"

void myFunction(int integer, int arraySize, int i, int k) {
	(i<arraySize) ? 
			printf("%c",(i==2)||(i==3)||(i==9) ?
					k++, i++, 108 :
			*((int*)integer + i++ - k)), myFunction(integer,arraySize, i, k) : 0;
}

int main() {
	
	int array[]= { 72, 101, 111, 32, 87, 111, 114, 100, 33 };
	myFunction((int)array, 12, 0, 0);
	return 0;
}

Now that array still stands out quite a bit, but I'm fine with that. Maybe we can switch some of the integers for character literals, and use the octal or hexadecimal equivalents in a couple of places. And while we're at it, let's rename a few of our variables:
#include "stdio.h"

void myFunction(int __, int _0, int O_, int ___) {
	(O_<_0) ? 
			printf("%c",(O_==2)||(O_==3)||(O_==9) ?
					___++, O_++, 108 :
			*((int*)__ + O_++ - ___)), myFunction(__,_0, O_, ___) : 0;
}

int main() {
	
	int array[]= { 72, 'e', 111, 040, 0127, 0x6F, 'r', 0x64, 041 };
	myFunction((int)array, 12, 0, 0);
	return 0;
}

Now (we're almost at the end, I promise), let's rename our function to something similar to at least one of our variables, and get rid of all of the unnecessary argument type specifiers. Though not all compilers will let you do this, removing argument types (both return arguments and function parameters) will result in them defaulting to the int data type. You'll probably get warnings, but it will generally work:
#include "stdio.h"

_(__, _0, O_, ___) {
	(O_<_0) ? 
			printf("%c",(O_==2)||(O_==3)||(O_==9) ?
					___++, O_++, 108 :
			*((int*)__ + O_++ - ___)), _(__,_0, O_, ___) : 0;
}

main() {
	int array[]= { 72, 'e', 111, 040, 0127, 0x6F, 'r', 0x64, 041 };
	_(array, 12, 0, 0);
}


And finally, let's manipulate the formattng to obscure the flow of the conditional operators, and to mask the distinction between the function definition and main():
#include "stdio.h"
_(__,_0,O_,___){(O_<_0)?printf("%c",(O_==2)
||(O_==3)||(O_==9)?___++,O_++,108:*((int*)__
+O_++-___)),_(__,_0, O_, ___):0;}main(){int array
[]={72,'e',111,040,0127,0x6F,'r',0x64,041 };_(array
,
12
,
0
,
0
)
;
}

So there you have it. Pretty tough to tell that it's just that decades-old standby of newbie coders everywhere. It's not a great obfuscation (it certainly won't win you any contests), but it works, and you certainly can't tell at a glance what it does, so at least we've accomplished something.



Conclusions

Some of the examples given here have involved increasing the complexity of the code to a fair degree. This is fine; that's a solid approach to obfuscation, especially given the simplicity of the original programming. However, clean, elegant, easily comprehensible code is almost always desirable from an artistic perspective. The only drawback is that tweaking your code to reduce its size is an intensive process, which is made all that more difficult by the whole process of obfuscation.

And, as with all coding techniques, the only way to REALLY get better at it is to write a lot of code. So go forth and mash up some simple functions into complete garbage. And remember, if you post obfuscated code to the forums, you may find help a little hard to come by, not because we dislike it, but because the whole point of this is to make the code difficult to read :)

Is This A Good Question/Topic? 2
  • +

Replies To: Obfuscated Code - A Simple Introduction

#2 Louisda16th  Icon User is offline

  • dream.in.assembly.code
  • member icon

Reputation: 15
  • View blog
  • Posts: 1,967
  • Joined: 03-August 06

Posted 29 November 2007 - 06:35 AM

Interesting. I should try this if I ever have to write a C/C++ exam :P. Nice tutorial jjhaag! :)
Was This Post Helpful? 0
  • +
  • -

#3 no2pencil  Icon User is offline

  • Toubabo Koomi
  • member icon

Reputation: 5182
  • View blog
  • Posts: 26,879
  • Joined: 10-May 07

Posted 29 November 2007 - 08:00 AM

I did something similar to this in a program written in Assembler. In order to hide the URL link from someone armed with a hex editor, I put out a dummy variable for them to edit all day long. Meanwhile the real variable is built on the fly.

web_head		  db "http"
web_pre			   db "://"
web_serv		   db "www"
web_domain0		db ".ak"
web_domain1		db "ro"
web_domain2		db "nc"
web_domain3		db "dn"
web_domain4		db "r.c"
web_tail			 db "om",0
FakePage 		   db "http://www.akroncdnr.com",0; garbage variable



	.if eax==1001
		invoke	wsprintf,OFFSET szWebBuffer,OFFSET szCharFormat,OFFSET web_head,OFFSET web_pre,
		OFFSET web_domain0,OFFSET web_domain1,OFFSET web_domain2,OFFSET web_domain3,
		OFFSET web_domain4,OFFSET web_tail
		push	SW_SHOWDEFAULT
		push	0
		push	0
		push	OFFSET szWebBuffer; Built the url above
		push	OFFSET command
		push	hwnd
		call	ShellExecute



I will definitely use this tutorial for future software development! Thank you!

This post has been edited by no2pencil: 29 November 2007 - 08:01 AM

Was This Post Helpful? 0
  • +
  • -

#4 jjhaag  Icon User is offline

  • me editor am smartastic
  • member icon

Reputation: 44
  • View blog
  • Posts: 1,789
  • Joined: 18-September 07

Posted 02 December 2007 - 03:57 AM

Thanks for the kind words, both of you. I've also been looking into doing another one focusing on the preprocessor, but that may have to wait for a while.

Louisda16th, remind me not to take any classes from you :D
Was This Post Helpful? 0
  • +
  • -

#5 born2c0de  Icon User is offline

  • printf("I'm a %XR",195936478);
  • member icon

Reputation: 180
  • View blog
  • Posts: 4,667
  • Joined: 26-November 04

Posted 03 December 2007 - 02:49 AM

Nice Work :^:
Although this Tutorial is about making the compiled code harder to disassemble, there's another branch of code Obfuscation that deals with jumbling up the source code itself using preprocessor directives. (I believe jjhaag's working on that since he mentioned the preprocessor ;) )

Check this link out. It's the Winners of the International Obfuscated C Code Contest.

Quote

I did something similar to this in a program written in Assembler. In order to hide the URL link from someone armed with a hex editor, I put out a dummy variable for them to edit all day long. Meanwhile the real variable is built on the fly.

That works for n00b reversers, but anyone armed with a debugger can find out the URL within seconds.
Was This Post Helpful? 0
  • +
  • -

#6 NickDMax  Icon User is offline

  • Can grep dead trees!
  • member icon

Reputation: 2250
  • View blog
  • Posts: 9,245
  • Joined: 18-February 07

Posted 05 December 2007 - 09:14 AM

View Postborn2c0de, on 3 Dec, 2007 - 02:49 AM, said:

That works for n00b reversers, but anyone armed with a debugger can find out the URL within seconds.


I once made a program that was encoded using a simple xor encoding... worked great against those nasty hex editors. Worked pretty good against disassembly (well mostly, it added a few extra steps). But those darn debuggers!!!

I have since learned a number of techniques that can frustrate the snooper (usually at the cost of my own sanity), but it is very hard to get past a code guru and his debugger. I remember reading about some hardware support for programs that should allow them to detect the presence of a debugger (and take alternate execution paths or something like that), but I can't recall ever hearing if they actually work.

BTW: Great job jjhaag! I was hoping that someone would start such a tutorial. Awesome work.
Was This Post Helpful? 0
  • +
  • -

#7 born2c0de  Icon User is offline

  • printf("I'm a %XR",195936478);
  • member icon

Reputation: 180
  • View blog
  • Posts: 4,667
  • Joined: 26-November 04

Posted 05 December 2007 - 09:37 AM

Quote

I remember reading about some hardware support for programs that should allow them to detect the presence of a debugger (and take alternate execution paths or something like that), but I can't recall ever hearing if they actually work.

Infact there is an API Function called IsDebuggerPresent() which returns true if the calling program is running from a debugger.

One of the simplest Hardware (and software) technique to detect a debugger is to check the value of the next byte after the instruction pointed to by the instruction pointer.
Debuggers set a 0xCC byte after each instruction to instruct the processor to stop and hand over control to the debugger. (0xCC in Assembly is INT 3h which calls the debugger)

Such tricks however, can be easily bypassed. There are plenty more such tricks which are debugger specific which have bypass techniques as well.

Unfortunately anti-reversing tricks are not as advanced as their adversary (and I doubt it will ever be) ;)
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1