Page 1 of 1

Fundamentals of Parsing Provides insight into parsing source using a working example Rate Topic: ***** 3 Votes

#1 Martyn.Rae  Icon User is offline

  • The programming dinosaur
  • member icon

Reputation: 540
  • View blog
  • Posts: 1,406
  • Joined: 22-August 09

Posted 03 February 2010 - 10:47 AM

*
POPULAR

Parsing - A tutorial on Source Decomposition

Parse (defn.)To analyze or separate (input, for example) into more easily processed components.

A Brief History

The parsing of input into more easily processed components has always been a necessary task for computer systems. Back in the days when I started my programming career, input and output for computer programs was achieved through punched cards or paper tape. Both of these primitive mechanisms required a key-punch operator (girl sat at a special punch machine), to type the input. The punch machine would translate each key pressed into punched holes onto either the card or the paper tape. These girls were very good at their job but to get them to count more than a few spaces was difficult.

Imagine a job that required five fields, a customer number, a name, the first and second line of an address and an amount field. The customer number was 10 characters in length, the name was 20 characters in length, the first ans second line of the address was 20 characters in length and the amount field was right justified and 10 characters in length. Each field started at a fixed place so customer number started at column 1, the name field started at column 11, the first line of the address at column 31, the second line at column 51, and the amount field at column 71. If the name was Susan, then you would need 15 spaces entered before the key-punch operator could enter the first line of the address. Obviously, that was unacceptable.

What was done was to provide a format such that each field was terminated (normally by a '[' character), so the input "0123456789[Susan Jones[21 Somewhere St.[Somewhere[2160" punched onto a card or paper tape would be parsed by the input program into the correct fields consisting of 0123456789 for the customer number, Susan Jones padded out to 20 characters, Somewhere St and Somewhere as the first and second address line fields both padded out to 20 characters and finally the 2160 right justified in a numeric amount field.

The first computer program that I ever wrote in COBOL demanded that columns 1-6 contained a line number, column 7 mas reserved for an asterisk that meant the rest of the line was a comment, column 8-12 was for the start of a label (although the label could spill over the column 12 position), column 13 - 71 was for the COBOL statement and column 72-80 was reserved for comments (usually date and initials) for an amendment.

Today we as programmers have a much easier life in that our input to the compiler is fundamentally free format providing we adhere to certain syntactic rules. In C++, it does not matter whether there is zero, one or one hundred spaces between the start of line and say the C++ reserved word break.

Writing a Parser

Overview

Before I explain to you anything about the example we are going to look at, it is crutial that you understand a very important point. In order to implement any parser, no matter how simple, you need to understand the input that you are required to write the parser for. This may sound rather condescending to some of you ... after all if you are reading this tutorial I must give your programming ability some credit! Let me say this. Of the six compilers and interpreters that I have written and sucessfully handed over to customers, once and only once have I fallen into the trap of not understanding the input. That cost me six months of not only my and the customers time, but also a lot of very sleepless nights correcting my mistake.

My choice of input parser for this tutorial was not easy, but I decided that as this tutorial was being placed in the C++ section, that a C++ preprocessor token parser was meaty enough to discuss some of the more interesting features that a parser must possess, without being a gigantuous task taking me months to write (I don't get paid for this ... just Kudos points!). Don't think for one minute, this is going to be way to advanced to understand as it is not.

The preprocessor parser is not sophisticated, and lacks many of the finer nuances necessary for a proper implementation. It is not intended to be directly incorporated into your own C++ compiler, although it does provide you with a start :o)

C++ Preprocessor Tokens

We need to know exactly what tokens our parser is going to produce from the code we parse. A token represents a piece of the code and consists normally of two pieces of information, namely the token type and the token value. I say normally, because some tokens can be described simply by their type (e.g. EOL).

The ISO C++ standards clearly define the tokens we need to process in a C++ preprocessor, and consist of symbol, integer, literal, punctuation and whitespace. In the example I have included, I have extended that to include invalid, const_literal, EOL, and EOF giving a grand total of nine tokens. (These additional tokens have been added for clarity, and to reduce the overall complexity of the code without detracting from the primary purpose, namely that of providing a tutorial).

Look-ahead Mechanism

Parsers will generally need to provide a look-ahead mechanism which permits then to check to see if the token is complete. An example of this would be the sequence 0x12AB. This as we all know would need to be parsed as an integer:0x12AB token, and not four separate tokens (i.e integer:0, symbol:x, integer:12symbol:AB). To do this, if we read a zero as the lead character for an integer sequence, then we need to peek one character forwards to determine if the next character is a 'X' or a 'x'. If the peeked character is a 'x' or 'X' then we can continue processing the sequence as hexadecimal otherwise we simply have either a zero or the start of an octal declaration. The preprocessor does not need to convert such declarations to binary. We are just dealing in strings.

C++ Preprocessor Header File

The header file for my C++ preprocessor parser is listed below.

#ifndef __TOKEN_DEFINED__
#define __TOKEN_DEFINED__

#pragma once

// All tokens must derive from this token type

class base_token
{
	public:
		typedef enum { 	t_invalid_token=0, t_symbol,
				t_integer, t_literal, 
				t_const_literal, t_punctuation,
				t_whitespace, t_eol, t_eof
	             } type_of_token;
	private:
		type_of_token token_type;
	public:
		base_token(type_of_token token) : token_type(token) { };
		virtual int parse_token(fstream& stream, int input_char) = 0;
		virtual void print_token() = 0;
};

// A token that may contain a symbol

class symbol_token : public base_token
{
	private:
		string symbol;
	public:
		symbol_token() : base_token(t_symbol) { };
		int parse_token(fstream& stream, int input_char);
		void print_token();
};

// A token that represents an integer

class integer_token : public base_token
{
	private:
		string integer_string;
	public:
		integer_token() : base_token(t_integer) { };
		int parse_token(fstream& stream, int input_char);
		void print_token();
};

// A token that represents a literal

class literal_token : public base_token
{
	private:
		string literal_string;
	public:
		literal_token() : base_token(t_literal) { };
		int parse_token(fstream& stream, int input_char);
		void print_token();
};

// A token that represents a constant literal (e.g. 'A')

class const_literal_token : public base_token
{
	private:
		string const_literal_string;
	public:
		const_literal_token() : base_token(t_const_literal) { };
		int parse_token(fstream& stream, int input_char);
		void print_token();
};

// A token that represents a punctuation or separator

class punctuation_token : public base_token
{
	private:
		string punctuation_string;
	public:
		punctuation_token() : base_token(t_punctuation) { };
		int parse_token(fstream& stream, int input_char);
		void print_token();
};

// A token that represents whitespace

class whitespace_token : public base_token
{
	public:
		whitespace_token() : base_token(t_whitespace) { };
		int parse_token(fstream& stream, int input_char);
		void print_token();
};

// A token that represents an eol

class eol_token : public base_token
{
	public:
		eol_token() : base_token(t_eol) { };
		int parse_token(fstream& stream, int input_char);
		void print_token();
};

// A token that represents an eof

class eof_token : public base_token
{
	public:
		eof_token() : base_token(t_eof) { };
		int parse_token(fstream& stream, int input_char);
		void print_token();
};

// A token that represents an illegal character

class invalid_token : public base_token
{
	private:
		int invalid_character;
	public:
		invalid_token() : base_token(t_invalid_token) { };
		int parse_token(fstream& stream, int input_char);
		void print_token();
};

// The C++ token parser

class token_parser
{
	private:
		fstream& source_stream;
		list<base_token *> token_list;
	public:
		token_parser(fstream& stream) : source_stream(stream) { };
		bool parse_tokens();
		void print_tokens();
};

#endif



You will notice that all of my tokens are derived from base_token which simply declares the various token types as an enumerator, the token type that this token represents, the constructor and two pure virtual methods. In the spirit of OOP, it is the token itself that performs the parse of the source once the token has been identified (more about that later), and the token that provides a print so we can check it's doing what it's supposed to. This latter method would not appear in a release version, but it is very useful to incorporate such a routine whilst coding, debugging and testing. Notice also that the whitespace_token, eol_token and eof_token derived classes do not carry any additional information other that the token type held by the base_token class.

The class token_parser declared at the end of the header file is simply the driving mechanism for the creation of the token classes. It's task is to indentify tokens based upon their leading character, create the correct token and then call that tokens parse_token method.

C++ Preprocessor Implemetation File

The following code represents the implementation file for our parser. Move to the end of this code block and we will look at it in more detail.

/*
	The following code parses C/C++ source and prints out a list of tokens.
	The code is intended to be used as an example of parsing, and nothing
	else. Having said that, it must be pointed out that whilst this parser
	does correctly identify and process C/C++ tokens, there are tokens
	parsed that would not exist when a true C++ compiler tokenised it's
	input. Preprocessor tokenization is implemented in phase III of the 
        compiler front-end and such tokens as #, ## would not be seen (i.e. they
        would have been processed in phase III). Furthermore, tokens enclosed in
        single quotes would be converted directly into their integer equivalents.
        Also, unicode characters and unicode sequences are not processed by this
	example.

	I have endeavoured to put a few checks in the parser to ensure that 
	double quoted literals and block comments are terminated before the end
	of file, but this code assumes that the source provided is actually 
	compilable!!

	The code has been compiled and tested under the Microsoft VC++ version
	10, B2 Release and no guarantee can be made as to it's compatibility with
	either other versions of VC++, or any other C++ compiler.

	If you have any problems, with this code please do not hesitate to ask.
*/

#include <iostream>
#include <fstream>
#include <string>
#include <list>
#include <cctype>

using namespace std;
#include "C++ Parser.h"

// parse the rest of a symbol
int symbol_token::parse_token(fstream& stream, int input_char) {
	symbol = input_char;
	while ( true ) {
		input_char = stream.get();
		if ( isalpha(input_char) || isdigit(input_char) || input_char == '_' ) {
			symbol += input_char;
			continue;
		}
		return input_char;
	}
}

// print the token to cout
void symbol_token::print_token() {
	cout << "TOKEN[\"symbol\" , \"" << symbol << "\"]" << endl;
}

// parse the rest of an integer
int integer_token::parse_token(fstream& stream, int input_char) {
	integer_string = input_char;
	if ( input_char == '0' )
	{
		input_char = stream.peek();
		if ( input_char == 'X' || input_char == 'x' ) {
			integer_string += input_char;
			input_char = stream.get();
			while ( true ) {
				input_char = stream.get();
				if ( isxdigit(input_char) ) {
					integer_string += input_char;
					continue;
				}
				return input_char;
			}
		}
	}
	while ( true ) {
		input_char = stream.get();
		if ( isdigit(input_char) ) {
			integer_string += input_char;
			continue;
		}
		return input_char;
	}
}

// print the token to cout
void integer_token::print_token() {
	cout << "TOKEN[\"integer\" , " << integer_string << "]" << endl;
}

// parse the rest of a literal
int literal_token::parse_token(fstream& stream, int input_char) {
	literal_string.clear();
	while ( true ) {
		input_char = stream.get();
		if ( input_char == '\\' ) {
			input_char = stream.peek();
			if ( input_char == '\"' || input_char == '\\' ) {
				literal_string += '\\';
				input_char = stream.get();
				literal_string += input_char;
				continue;
			}
			if ( input_char == 0x0A ) {
				cout << "error: EOL encountered before closing literal quotes" << endl;
				exit(0);
			}
			if ( input_char == -1 ) {
				cout << "error: EOF encountered before closing literal quotes" << endl;
				exit(0);
			}
			literal_string += input_char;
			continue;
		}
		if ( input_char != '\"' && input_char != -1 ) {
			literal_string += input_char;
			continue;
		}
		if ( input_char == -1 ) {
			cout << "error: EOF encountered before closing literal quotes" << endl;
			exit(0);
		}
		input_char = stream.get();
		return input_char;
	}
}

// print the token to cout
void literal_token::print_token() {
	cout << "TOKEN[\"literal\" , \"" << literal_string << "\"]" << endl;
}

// parse the rest of a literal
int const_literal_token::parse_token(fstream& stream, int input_char) {
	const_literal_string.clear();
	while ( true ) {
		input_char = stream.get();
		if ( input_char == '\\' ) {
			input_char = stream.peek();
			if ( input_char == '\'' || input_char == '\\' ) {
				const_literal_string += '\\';
				input_char = stream.get();
				const_literal_string += input_char;
				continue;
			}
			const_literal_string += input_char;
			continue;
		}
		if ( input_char != '\'' ) {
			const_literal_string += input_char;
			continue;
		}
		input_char = stream.get();
		return input_char;
	}
}

// print the token to cout
void const_literal_token::print_token() {
	cout << "TOKEN[\"constant literal\" , \"" << const_literal_string << "\"]" << endl;
}

// parse the rest of a punctuation sequence - this consists of
// ending up with either one, two or three characters in the
// punctuation string. NB: The sequence .. is accepted as a 
// punctuation token, but must be rejected by the compiler at
// some later stage.
int punctuation_token::parse_token(fstream& stream, int input_char) {
	punctuation_string = input_char;
	switch ( input_char ) {
	case '!': // Looking for either ! or !=
		input_char = stream.peek();
		if ( input_char == '=' ) {
			input_char = stream.get();
			punctuation_string += input_char; // != token
		}
		break;
	case '#': // Looking for either # or ##
		input_char = stream.peek();
		if ( input_char == '#' ) {
			input_char = stream.get();
			punctuation_string += input_char; // ## token
		}
		break;
	case '%': // Looking for either % or %=
		input_char = stream.peek();
		if ( input_char == '=' ) {			
			input_char = stream.get();
			punctuation_string += input_char; // %= token
		}
		break;
	case '&': // Looking for either &, && or &=
		input_char = stream.peek();
		if ( input_char == '&' || input_char == '=' ) {
			input_char = stream.get();
			punctuation_string += input_char; // && token
		}
		break;
	case '*': // Looking for either * or *=
		input_char = stream.peek();
		if ( input_char == '=' ) {
			input_char = stream.get();
			punctuation_string += input_char; // *= token
		}
		break;
	case '+': // Looking for either +, ++, or +=
		input_char = stream.peek();
		if ( input_char == '+' || input_char == '=' ) {
			input_char = stream.get();
			punctuation_string += input_char; // ++ or += token
		}
		break;
	case '-': // Looking for either -, --, -=, ->, ->*
		input_char = stream.peek();
		if ( input_char == '-' || input_char == '=' ) {
			input_char = stream.get();
			punctuation_string += input_char; // -- or -= token
		}
		if ( input_char == '>' ) {
			input_char = stream.get();
			punctuation_string += input_char; // -> token
			input_char = stream.peek();
			if ( input_char == '*' ) {
				input_char = stream.get();
				punctuation_string += input_char; // ->* token
			}
		}
		break;
	case '.': // Looking for either ., .. or ...
		input_char = stream.peek();
		if ( input_char == '.' ) {
			input_char = stream.get();
			punctuation_string += input_char; // .. token (illegal!)
			input_char = stream.peek();
			if ( input_char == '.' ) {
				input_char = stream.get();
				punctuation_string += input_char; // ... token
			}
		}
		break;
	case '/': // Looking for either / or /=
		input_char = stream.peek();
		if ( input_char == '=' ) {
			input_char = stream.get();
			punctuation_string += input_char; // /= token
		}
		break;
	case ':': // Looking for either : or ::
		input_char = stream.peek();
		if ( input_char == ':' ) {
			input_char = stream.get();
			punctuation_string += input_char; // :: token
		}
		break;
	case '<': // Looking for either < or <=, <<, or <<=
		input_char = stream.peek();
		if ( input_char == '='  ) {
			input_char = stream.get();
			punctuation_string += input_char; // <= token
			break;
		}
		if ( input_char == '<'  ) {
			input_char = stream.get();
			punctuation_string += input_char; // << token
			input_char = stream.peek();
			if ( input_char == '=' ) {
				input_char = stream.get();
				punctuation_string += input_char; // <<= token
			}
		}
		break;
	case '=': // Looking for either = or ==
		input_char = stream.peek();
		if ( input_char == '=' ) {
			input_char = stream.get();
			punctuation_string += input_char; // == token
		}
		break;
	case '>': // Looking for either >, >=, >>, or >>=
		input_char = stream.peek();
		if ( input_char == '='  ) {
			input_char = stream.get();
			punctuation_string += input_char; // >= token
			break;
		}
		if ( input_char == '>'  ) {
			input_char = stream.get();
			punctuation_string += input_char; // >> token
			input_char = stream.peek();
			if ( input_char == '=' ) {
				input_char = stream.get();
				punctuation_string += input_char; // >>= token
			}
		}
		break;
	case '|': // Looking for either |, |=, or || 
		input_char = stream.peek();
		if ( input_char == '|' || input_char == '=' ) {
			input_char = stream.get();
			punctuation_string += input_char; // || or |= token
		}
		break;
	}
	input_char = stream.get();
	return input_char;
}

// print the token to cout
void punctuation_token::print_token() {
	cout << "TOKEN[\"punctuation\" , \"" << punctuation_string << "\"]" << endl;
}

// parse the whitespace characters
int whitespace_token::parse_token(fstream& stream, int input_char) {
	while ( true ) {
		input_char = stream.get();
		if ( input_char == ' ' || input_char == 0x09 || input_char == 0x0B || input_char == 0x0D ) {
			continue;
		}
		return input_char;
	}
}

// print the token to cout
void whitespace_token::print_token() {
	cout << "TOKEN[\"whitespace\" , \" \"]" << endl;
}

// parse the eol character
int eol_token::parse_token(fstream& stream, int input_char) {
	while ( true ) {
		input_char = stream.get();
		return input_char;
	}
}

// print the token to cout
void eol_token::print_token() {
	cout << "TOKEN[\"EOL\"]" << endl;
}

// parse the eof character
int eof_token::parse_token(fstream& stream, int input_char) {
	return 0;
}

// print the token to cout
void eof_token::print_token(void) {
	cout << "TOKEN[\"EOF\"]" << endl;
}

// parse the invalid character
int invalid_token::parse_token(fstream& stream, int input_char) {
	invalid_character = input_char;
	input_char = stream.get();
	return input_char;
}

// print the token to cout
void invalid_token::print_token(void) {
	cout << "TOKEN[\"INVALID\"" << invalid_character << endl;
}

// parse the input source
bool token_parser::parse_tokens() {
	base_token * token;

	while ( !source_stream.eof() ) {
		int input_char = source_stream.get();

		// Determine what the leading character is of the sequence,
		// create an appropriate token and get the actual token
		// class to parse the rest of it (if any)
		
		while ( !source_stream.eof() ) {
			// The following do loop is there only because I hate seeing
			// if () ... else if () ... else if () ... code!!!
			// Hence it's a do ... while ( false ) - single shot
			do
			{
				// Remove any comments from the source
				if ( input_char == '/' ) {
					int peek_character = source_stream.peek();
					if ( peek_character == '/' ) {
						// Remove the line comment
						while ( peek_character != 0x0A && !source_stream.eof() ) {
							peek_character = source_stream.get();
						}
						token = new(nothrow) eol_token;
						break;
					}
					if ( peek_character == '*' ) {
						// Remove a block comment
						while ( true ) {
							peek_character = source_stream.get();
							if ( peek_character == -1 ) {
								cout << "error: block comment not terminated before EOF" << endl;
								exit(0);
							}
							if ( peek_character == 0x0A ) {
								token = new(nothrow) eol_token;
								// Add the token to the end of the list
								token_list.push_back(token);
								continue;
							}
							if ( peek_character == '*' ) {
								peek_character = source_stream.get();
								if ( peek_character == -1 ) {
									cout << "error: block comment not terminated before EOF" << endl;
									exit(0);
								}
								if ( peek_character == '/' ) {
									// We need to ensure that a whitespace token
									// is created to ensure /* */ in the middle
									// of a source line is processed correctly.
									input_char = source_stream.get();
									input_char = ' ';
									token = new(nothrow) whitespace_token;
									break;
								}
							}
						}
					}
				}
				if ( isalpha(input_char) || input_char == '_' ) {
					// Start of a symbol sequence
					token = new(nothrow) symbol_token;
					break;
				}
				if ( input_char == 0x0A ) {
					// EOL
					token = new(nothrow) eol_token;
					break;
				}
				if ( isspace(input_char) ) {
					// Start of whitespace sequence
					token = new(nothrow) whitespace_token;
					break;
				}
				if ( input_char == '\"' ) {
					// Start of literal sequence
					token = new(nothrow) literal_token;
					break;
				}
				if ( input_char == '\'' ) {
					// Start of constant literal sequence
					token = new(nothrow) const_literal_token;
					break;
				}
				if ( isdigit(input_char) ) {
					// Start of number sequence
					token = new(nothrow) integer_token;
					break;
				}
				if ( ispunct(input_char) ) {
					// Start of punctuation sequence
					token = new(nothrow) punctuation_token;
					break;
				}
			}
			while ( false );
			if ( token == NULL ) return false;
			input_char = token->parse_token(source_stream, input_char);
			// Add the token to the end of the list
			token_list.push_back(token);
			continue;
		}
	}
	// Add the EOF token to the end of the list
	token = new(nothrow) eof_token;
	token_list.push_back(token);
	return true;
}

// Simply iterate through the list of tokens and print them to cout
// Of course, get the token object to print itself :o)
void token_parser::print_tokens() {
	list<base_token *>::iterator iterator;
	iterator = token_list.begin();
	while( iterator != token_list.end() ) {
		(*iterator)->print_token();
		++iterator;
	}
}

// main program entry point
int main(int argc, char *argv[])
{
	// Check to see that we have at least a filename
	if ( argc < 2 ) {
		cout << "Invalid command line arguments: need filename" << endl;
		_exit(0);
	}
	string filename = argv[argc-1];

	fstream source;

	// ope the source file
	source.open(filename.c_str(), ios_base::in);
	if ( source.fail() ) {
		cout << "An error has occurred whilst opening "<< filename << endl;
		_exit(0);
	}

	// Create the token list
	token_parser parser(source);
	parser.parse_tokens();
	parser.print_tokens();
}



Parsing a symbol using the symbol_token class[/b]

A symbol within the context of the C++ preprocessor is any group of characters that begins with a character in the range [A..Z] or [a..z] or underscore (_). The method symbol_token::parse_token expects this first character to be provided in the method call.

// parse the rest of a symbol
int symbol_token::parse_token(fstream& stream, int input_char) {
	symbol = input_char;
	while ( true ) {
		input_char = stream.get();
		if ( isalpha(input_char) || isdigit(input_char) || input_char == '_' ) {
			symbol += input_char;
			continue;
		}
		return input_char;
	}
}



The method reads each character from the input stream and providing it is in the range [A..Z], [a..z], [0..9] or underscore, it is appended to the symbol string. When a character is read that cannot be part of a symbol name, it is simply returned to the caller.

Parsing an integer using the integer_token class

An integer within the context of the C++ preprocessor is any group of characters that begins with a character in the range [0..9]. The method integer_token::parse_token expects this first character to be provided in the method call. The code must use look-ahead in order that a '0x' or '0X' declaration does not end the integer sequence after the '0', but continues to parse the remainder of the hexadecimal declaration.

// parse the rest of an integer
int integer_token::parse_token(fstream& stream, int input_char) {
	integer_string = input_char;
	if ( input_char == '0' )
	{
		input_char = stream.peek();
		if ( input_char == 'X' || input_char == 'x' ) {
			integer_string += input_char;
			input_char = stream.get();
			while ( true ) {
				input_char = stream.get();
				if ( isxdigit(input_char) ) {
					integer_string += input_char;
					continue;
				}
				return input_char;
			}
		}
	}
	while ( true ) {
		input_char = stream.get();
		if ( isdigit(input_char) ) {
			integer_string += input_char;
			continue;
		}
		return input_char;
	}
}



The method reads each character from the input stream and providing it is in the range [0..9] or [0..9][A-F][a..f] (if we are parsing a hexadecimal), it is appended to the integer string. When a character is read that cannot be part of a integer, it is simply returned to the caller.

Parsing a literal using the literal_token class

Parsing a literal between quotes may at first seem simple enough. We just simply append each character to the literal string until we reach the closing quotation marks. To do that fine if we did not have escape sequences that can define double quotation marks using the backslash double quotation mark sequence (\"). If that was not enough, we also have to contend with the backslash backslash sequence. Consider the following "\"The cat sat on the mat\\". We have to deal with the '\\' sequence as a complete escape sequence before we terminate on the closing quotes (i.e we cannot mistake the characters as backslash followed by a \").

// parse the rest of a literal
int literal_token::parse_token(fstream& stream, int input_char) {
	literal_string.clear();
	while ( true ) {
		input_char = stream.get();
		if ( input_char == '\\' ) {
			input_char = stream.peek();
			if ( input_char == '\"' || input_char == '\\' ) {
				literal_string += '\\';
				input_char = stream.get();
				literal_string += input_char;
				continue;
			}
			if ( input_char == 0x0A ) {
				cout << "error: EOL encountered before closing literal quotes" << endl;
				exit(0);
			}
			if ( input_char == -1 ) {
				cout << "error: EOF encountered before closing literal quotes" << endl;
				exit(0);
			}
			literal_string += input_char;
			continue;
		}
		if ( input_char != '\"' && input_char != -1 ) {
			literal_string += input_char;
			continue;
		}
		if ( input_char == -1 ) {
			cout << "error: EOF encountered before closing literal quotes" << endl;
			exit(0);
		}
		input_char = stream.get();
		return input_char;
	}
}



Notice also the albeit rather harsh termination of the parser if we encounter either the end of file or a new line in the literal before we encounter the closing double quotes. I have written the example is this way, as I do not wish to cover error recovery and back-tracking (this may well appear as another tutorial at some later point in time).

Parsing a quoted constant using the const_literal_token class

This method follows pretty much the same principles as the literal_token class except that I have not included the error checking in the method. This is specifically, so you can do it when you are messing with the code. It would follow exactly the same lines as the literal_token class.

Parsing punctuation using the punctuation_token class

This is quite a lengthy method but follows the same peek rules as we have already seen. Hopefully, it speaks for itself, so I will not go into much more detail. One thing that does need to be pointed out though is the acceptance of the .. (double period). Did the coder intend a single period or a ... (triple period) known as ellipsis. A true compiler may accept the invalid sequence at this stage, and later complain about it, or produce an error now. I have chosen to ignore the error here.

Parsing whitespace, EOL, EOF and invalid characters

These routines speak for themselves.

Driving the parser using the token_parser class

The token_parser class method parse_tokens is the driving force for the preprocessor token parser. If you follow the code, you will notice that much of the code is straight forward, although I need to explain the removal of block comments. In order for subsequent phases of a C++ compiler to provide error messages including line numbers, we cannot simply ignore end of line's just because they are part of a block comment. Here, therefore I have written the code such that when we are in a block comment, any end of line characters are actually tokenised and appended to the list of tokens representing the source being parsed. In this way, we do not lose the correct number of end of line characters and error reporting for a line will have the correct line number.

Printing tokens

The token_parser method print_tokens, simply iterates through the list of tokens and asks each token in turn to print itself to the console, in the form TOKEN["token_type" , "token_contents"]. For example, here is a small sample output for the code

#include <iostream>
int main((int argc, char *argv[]) {
  cout << "Hello world!" << endl;
}



Token output is

TOKEN["punctuation" , "#"]
TOKEN["symbol" , "include"]
TOKEN["whitespace" , " "]
TOKEN["punctuation" , "<"]
TOKEN["symbol" , "iostream"]
TOKEN["punctuation" , ">"]
TOKEN["EOL"]
TOKEN["symbol" , "int"]
TOKEN["whitespace" , " "]
TOKEN["symbol" , "main"]
TOKEN["punctuation" , "("]
TOKEN["punctuation" , "("]
TOKEN["symbol" , "int"]
TOKEN["whitespace" , " "]
TOKEN["symbol" , "argc"]
TOKEN["punctuation" , ","]
TOKEN["whitespace" , " "]
TOKEN["symbol" , "char"]
TOKEN["whitespace" , " "]
TOKEN["punctuation" , "*"]
TOKEN["symbol" , "argv"]
TOKEN["punctuation" , "["]
TOKEN["punctuation" , "]"]
TOKEN["punctuation" , ")"]
TOKEN["whitespace" , " "]
TOKEN["punctuation" , "{"]
TOKEN["EOL"]
TOKEN["whitespace" , " "]
TOKEN["symbol" , "cout"]
TOKEN["whitespace" , " "]
TOKEN["punctuation" , "<<"]
TOKEN["whitespace" , " "]
TOKEN["literal" , "Hello world!"]
TOKEN["whitespace" , " "]
TOKEN["punctuation" , "<<"]
TOKEN["whitespace" , " "]
TOKEN["symbol" , "endl"]
TOKEN["punctuation" , ";"]
TOKEN["EOL"]
TOKEN["punctuation" , "}"]
TOKEN["EOL"]
TOKEN["EOF"]



Taking the code further

You are free to copy this code and do with it as you please. You might like to modify the code such that it counts the number of individual token types in a given source file. You might like to have a go at extending the code to identify keywords, create a cross-reference program that tells you on which lines symbols are used. The possibilities are endless.

Writing your own parsers for other languages

If you have understood the fundamentals behind this parser, try writing your own parser for say Visual Basic or Java, or even your own language. I am always here to help should you run into problems.

Is This A Good Question/Topic? 13
  • +

Replies To: Fundamentals of Parsing

#2 athlon32  Icon User is offline

  • D.I.C Regular
  • member icon

Reputation: 116
  • View blog
  • Posts: 363
  • Joined: 20-August 08

Posted 06 February 2010 - 01:08 PM

Great Tutorial! After briefly reading this, I went out and wrote a parser for a different language completely from scratch :)

This is kinda going out on a stretch, but could make some other tutorials on compilers/interpreters? I'm interested in the subject, and you're such a great author :)

This post has been edited by athlon32: 06 February 2010 - 01:09 PM

Was This Post Helpful? 0
  • +
  • -

#3 Martyn.Rae  Icon User is offline

  • The programming dinosaur
  • member icon

Reputation: 540
  • View blog
  • Posts: 1,406
  • Joined: 22-August 09

Posted 06 February 2010 - 07:43 PM

Your wish is my command!! :bigsmile:

I have already submitted two parts of a five part tutorial on Writing an Interpreter.
Was This Post Helpful? 2
  • +
  • -

#4 sm5312  Icon User is offline

  • D.I.C Head

Reputation: 2
  • View blog
  • Posts: 106
  • Joined: 09-April 09

Posted 08 May 2010 - 11:45 AM

Thank you Ray for this great tutorial.
but is this really a 'parser' or a 'lexical analyser'? isnt this more towards a LA ?
Was This Post Helpful? 0
  • +
  • -

#5 chance528  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 27
  • Joined: 02-September 10

Posted 06 September 2010 - 09:59 AM

View PostGuest, on 06 September 2010 - 08:57 AM, said:

View PostJackOfAllTrades, on 06 September 2010 - 05:37 AM, said:

Part of learning is trial and error...did you try what I suggested?


Yeah, I did. I recompiled and just received one error. It was a reference to WinMain@16, because I didn't really make a main() function yet.

Thanks.

chance


Oh! Sorry, JackOfAllTrades. I posted that. I forgot to log myself in.
Was This Post Helpful? 0
  • +
  • -

#6 dastick  Icon User is offline

  • D.I.C Head

Reputation: 0
  • View blog
  • Posts: 56
  • Joined: 18-March 11

Posted 28 March 2011 - 12:14 PM

Ive been learning c++ for 7 months now and am comfortable with it, but when I look at this it is realy daunting even when I look at your explanations. It's not you thats typed it bad it's me! by the way, is there any parsers simpler than this
Was This Post Helpful? 0
  • +
  • -

#7 vividexstance  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 651
  • View blog
  • Posts: 2,233
  • Joined: 31-December 10

Posted 10 April 2011 - 12:39 PM

The answer to your question would depend entirely on what exactly you're parsing. If it's going to be for a programming language, then it most likely will be complicated.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1