Parse (defn.)To analyze or separate (input, for example) into more easily processed components.
A Brief History
The parsing of input into more easily processed components has always been a necessary task for computer systems. Back in the days when I started my programming career, input and output for computer programs was achieved through punched cards or paper tape. Both of these primitive mechanisms required a key-punch operator (girl sat at a special punch machine), to type the input. The punch machine would translate each key pressed into punched holes onto either the card or the paper tape. These girls were very good at their job but to get them to count more than a few spaces was difficult.
Imagine a job that required five fields, a customer number, a name, the first and second line of an address and an amount field. The customer number was 10 characters in length, the name was 20 characters in length, the first ans second line of the address was 20 characters in length and the amount field was right justified and 10 characters in length. Each field started at a fixed place so customer number started at column 1, the name field started at column 11, the first line of the address at column 31, the second line at column 51, and the amount field at column 71. If the name was Susan, then you would need 15 spaces entered before the key-punch operator could enter the first line of the address. Obviously, that was unacceptable.
What was done was to provide a format such that each field was terminated (normally by a '[' character), so the input "0123456789[Susan Jones[21 Somewhere St.[Somewhere[2160" punched onto a card or paper tape would be parsed by the input program into the correct fields consisting of 0123456789 for the customer number, Susan Jones padded out to 20 characters, Somewhere St and Somewhere as the first and second address line fields both padded out to 20 characters and finally the 2160 right justified in a numeric amount field.
The first computer program that I ever wrote in COBOL demanded that columns 1-6 contained a line number, column 7 mas reserved for an asterisk that meant the rest of the line was a comment, column 8-12 was for the start of a label (although the label could spill over the column 12 position), column 13 - 71 was for the COBOL statement and column 72-80 was reserved for comments (usually date and initials) for an amendment.
Today we as programmers have a much easier life in that our input to the compiler is fundamentally free format providing we adhere to certain syntactic rules. In C++, it does not matter whether there is zero, one or one hundred spaces between the start of line and say the C++ reserved word break.
Writing a Parser
Overview
Before I explain to you anything about the example we are going to look at, it is crutial that you understand a very important point. In order to implement any parser, no matter how simple, you need to understand the input that you are required to write the parser for. This may sound rather condescending to some of you ... after all if you are reading this tutorial I must give your programming ability some credit! Let me say this. Of the six compilers and interpreters that I have written and sucessfully handed over to customers, once and only once have I fallen into the trap of not understanding the input. That cost me six months of not only my and the customers time, but also a lot of very sleepless nights correcting my mistake.
My choice of input parser for this tutorial was not easy, but I decided that as this tutorial was being placed in the C++ section, that a C++ preprocessor token parser was meaty enough to discuss some of the more interesting features that a parser must possess, without being a gigantuous task taking me months to write (I don't get paid for this ... just Kudos points!). Don't think for one minute, this is going to be way to advanced to understand as it is not.
The preprocessor parser is not sophisticated, and lacks many of the finer nuances necessary for a proper implementation. It is not intended to be directly incorporated into your own C++ compiler, although it does provide you with a start :o)
C++ Preprocessor Tokens
We need to know exactly what tokens our parser is going to produce from the code we parse. A token represents a piece of the code and consists normally of two pieces of information, namely the token type and the token value. I say normally, because some tokens can be described simply by their type (e.g. EOL).
The ISO C++ standards clearly define the tokens we need to process in a C++ preprocessor, and consist of symbol, integer, literal, punctuation and whitespace. In the example I have included, I have extended that to include invalid, const_literal, EOL, and EOF giving a grand total of nine tokens. (These additional tokens have been added for clarity, and to reduce the overall complexity of the code without detracting from the primary purpose, namely that of providing a tutorial).
Look-ahead Mechanism
Parsers will generally need to provide a look-ahead mechanism which permits then to check to see if the token is complete. An example of this would be the sequence 0x12AB. This as we all know would need to be parsed as an integer:0x12AB token, and not four separate tokens (i.e integer:0, symbol:x, integer:12symbol:AB). To do this, if we read a zero as the lead character for an integer sequence, then we need to peek one character forwards to determine if the next character is a 'X' or a 'x'. If the peeked character is a 'x' or 'X' then we can continue processing the sequence as hexadecimal otherwise we simply have either a zero or the start of an octal declaration. The preprocessor does not need to convert such declarations to binary. We are just dealing in strings.
C++ Preprocessor Header File
The header file for my C++ preprocessor parser is listed below.
#ifndef __TOKEN_DEFINED__
#define __TOKEN_DEFINED__
#pragma once
// All tokens must derive from this token type
class base_token
{
public:
typedef enum { t_invalid_token=0, t_symbol,
t_integer, t_literal,
t_const_literal, t_punctuation,
t_whitespace, t_eol, t_eof
} type_of_token;
private:
type_of_token token_type;
public:
base_token(type_of_token token) : token_type(token) { };
virtual int parse_token(fstream& stream, int input_char) = 0;
virtual void print_token() = 0;
};
// A token that may contain a symbol
class symbol_token : public base_token
{
private:
string symbol;
public:
symbol_token() : base_token(t_symbol) { };
int parse_token(fstream& stream, int input_char);
void print_token();
};
// A token that represents an integer
class integer_token : public base_token
{
private:
string integer_string;
public:
integer_token() : base_token(t_integer) { };
int parse_token(fstream& stream, int input_char);
void print_token();
};
// A token that represents a literal
class literal_token : public base_token
{
private:
string literal_string;
public:
literal_token() : base_token(t_literal) { };
int parse_token(fstream& stream, int input_char);
void print_token();
};
// A token that represents a constant literal (e.g. 'A')
class const_literal_token : public base_token
{
private:
string const_literal_string;
public:
const_literal_token() : base_token(t_const_literal) { };
int parse_token(fstream& stream, int input_char);
void print_token();
};
// A token that represents a punctuation or separator
class punctuation_token : public base_token
{
private:
string punctuation_string;
public:
punctuation_token() : base_token(t_punctuation) { };
int parse_token(fstream& stream, int input_char);
void print_token();
};
// A token that represents whitespace
class whitespace_token : public base_token
{
public:
whitespace_token() : base_token(t_whitespace) { };
int parse_token(fstream& stream, int input_char);
void print_token();
};
// A token that represents an eol
class eol_token : public base_token
{
public:
eol_token() : base_token(t_eol) { };
int parse_token(fstream& stream, int input_char);
void print_token();
};
// A token that represents an eof
class eof_token : public base_token
{
public:
eof_token() : base_token(t_eof) { };
int parse_token(fstream& stream, int input_char);
void print_token();
};
// A token that represents an illegal character
class invalid_token : public base_token
{
private:
int invalid_character;
public:
invalid_token() : base_token(t_invalid_token) { };
int parse_token(fstream& stream, int input_char);
void print_token();
};
// The C++ token parser
class token_parser
{
private:
fstream& source_stream;
list<base_token *> token_list;
public:
token_parser(fstream& stream) : source_stream(stream) { };
bool parse_tokens();
void print_tokens();
};
#endif
You will notice that all of my tokens are derived from base_token which simply declares the various token types as an enumerator, the token type that this token represents, the constructor and two pure virtual methods. In the spirit of OOP, it is the token itself that performs the parse of the source once the token has been identified (more about that later), and the token that provides a print so we can check it's doing what it's supposed to. This latter method would not appear in a release version, but it is very useful to incorporate such a routine whilst coding, debugging and testing. Notice also that the whitespace_token, eol_token and eof_token derived classes do not carry any additional information other that the token type held by the base_token class.
The class token_parser declared at the end of the header file is simply the driving mechanism for the creation of the token classes. It's task is to indentify tokens based upon their leading character, create the correct token and then call that tokens parse_token method.
C++ Preprocessor Implemetation File
The following code represents the implementation file for our parser. Move to the end of this code block and we will look at it in more detail.
/*
The following code parses C/C++ source and prints out a list of tokens.
The code is intended to be used as an example of parsing, and nothing
else. Having said that, it must be pointed out that whilst this parser
does correctly identify and process C/C++ tokens, there are tokens
parsed that would not exist when a true C++ compiler tokenised it's
input. Preprocessor tokenization is implemented in phase III of the
compiler front-end and such tokens as #, ## would not be seen (i.e. they
would have been processed in phase III). Furthermore, tokens enclosed in
single quotes would be converted directly into their integer equivalents.
Also, unicode characters and unicode sequences are not processed by this
example.
I have endeavoured to put a few checks in the parser to ensure that
double quoted literals and block comments are terminated before the end
of file, but this code assumes that the source provided is actually
compilable!!
The code has been compiled and tested under the Microsoft VC++ version
10, B2 Release and no guarantee can be made as to it's compatibility with
either other versions of VC++, or any other C++ compiler.
If you have any problems, with this code please do not hesitate to ask.
*/
#include <iostream>
#include <fstream>
#include <string>
#include <list>
#include <cctype>
using namespace std;
#include "C++ Parser.h"
// parse the rest of a symbol
int symbol_token::parse_token(fstream& stream, int input_char) {
symbol = input_char;
while ( true ) {
input_char = stream.get();
if ( isalpha(input_char) || isdigit(input_char) || input_char == '_' ) {
symbol += input_char;
continue;
}
return input_char;
}
}
// print the token to cout
void symbol_token::print_token() {
cout << "TOKEN[\"symbol\" , \"" << symbol << "\"]" << endl;
}
// parse the rest of an integer
int integer_token::parse_token(fstream& stream, int input_char) {
integer_string = input_char;
if ( input_char == '0' )
{
input_char = stream.peek();
if ( input_char == 'X' || input_char == 'x' ) {
integer_string += input_char;
input_char = stream.get();
while ( true ) {
input_char = stream.get();
if ( isxdigit(input_char) ) {
integer_string += input_char;
continue;
}
return input_char;
}
}
}
while ( true ) {
input_char = stream.get();
if ( isdigit(input_char) ) {
integer_string += input_char;
continue;
}
return input_char;
}
}
// print the token to cout
void integer_token::print_token() {
cout << "TOKEN[\"integer\" , " << integer_string << "]" << endl;
}
// parse the rest of a literal
int literal_token::parse_token(fstream& stream, int input_char) {
literal_string.clear();
while ( true ) {
input_char = stream.get();
if ( input_char == '\\' ) {
input_char = stream.peek();
if ( input_char == '\"' || input_char == '\\' ) {
literal_string += '\\';
input_char = stream.get();
literal_string += input_char;
continue;
}
if ( input_char == 0x0A ) {
cout << "error: EOL encountered before closing literal quotes" << endl;
exit(0);
}
if ( input_char == -1 ) {
cout << "error: EOF encountered before closing literal quotes" << endl;
exit(0);
}
literal_string += input_char;
continue;
}
if ( input_char != '\"' && input_char != -1 ) {
literal_string += input_char;
continue;
}
if ( input_char == -1 ) {
cout << "error: EOF encountered before closing literal quotes" << endl;
exit(0);
}
input_char = stream.get();
return input_char;
}
}
// print the token to cout
void literal_token::print_token() {
cout << "TOKEN[\"literal\" , \"" << literal_string << "\"]" << endl;
}
// parse the rest of a literal
int const_literal_token::parse_token(fstream& stream, int input_char) {
const_literal_string.clear();
while ( true ) {
input_char = stream.get();
if ( input_char == '\\' ) {
input_char = stream.peek();
if ( input_char == '\'' || input_char == '\\' ) {
const_literal_string += '\\';
input_char = stream.get();
const_literal_string += input_char;
continue;
}
const_literal_string += input_char;
continue;
}
if ( input_char != '\'' ) {
const_literal_string += input_char;
continue;
}
input_char = stream.get();
return input_char;
}
}
// print the token to cout
void const_literal_token::print_token() {
cout << "TOKEN[\"constant literal\" , \"" << const_literal_string << "\"]" << endl;
}
// parse the rest of a punctuation sequence - this consists of
// ending up with either one, two or three characters in the
// punctuation string. NB: The sequence .. is accepted as a
// punctuation token, but must be rejected by the compiler at
// some later stage.
int punctuation_token::parse_token(fstream& stream, int input_char) {
punctuation_string = input_char;
switch ( input_char ) {
case '!': // Looking for either ! or !=
input_char = stream.peek();
if ( input_char == '=' ) {
input_char = stream.get();
punctuation_string += input_char; // != token
}
break;
case '#': // Looking for either # or ##
input_char = stream.peek();
if ( input_char == '#' ) {
input_char = stream.get();
punctuation_string += input_char; // ## token
}
break;
case '%': // Looking for either % or %=
input_char = stream.peek();
if ( input_char == '=' ) {
input_char = stream.get();
punctuation_string += input_char; // %= token
}
break;
case '&': // Looking for either &, && or &=
input_char = stream.peek();
if ( input_char == '&' || input_char == '=' ) {
input_char = stream.get();
punctuation_string += input_char; // && token
}
break;
case '*': // Looking for either * or *=
input_char = stream.peek();
if ( input_char == '=' ) {
input_char = stream.get();
punctuation_string += input_char; // *= token
}
break;
case '+': // Looking for either +, ++, or +=
input_char = stream.peek();
if ( input_char == '+' || input_char == '=' ) {
input_char = stream.get();
punctuation_string += input_char; // ++ or += token
}
break;
case '-': // Looking for either -, --, -=, ->, ->*
input_char = stream.peek();
if ( input_char == '-' || input_char == '=' ) {
input_char = stream.get();
punctuation_string += input_char; // -- or -= token
}
if ( input_char == '>' ) {
input_char = stream.get();
punctuation_string += input_char; // -> token
input_char = stream.peek();
if ( input_char == '*' ) {
input_char = stream.get();
punctuation_string += input_char; // ->* token
}
}
break;
case '.': // Looking for either ., .. or ...
input_char = stream.peek();
if ( input_char == '.' ) {
input_char = stream.get();
punctuation_string += input_char; // .. token (illegal!)
input_char = stream.peek();
if ( input_char == '.' ) {
input_char = stream.get();
punctuation_string += input_char; // ... token
}
}
break;
case '/': // Looking for either / or /=
input_char = stream.peek();
if ( input_char == '=' ) {
input_char = stream.get();
punctuation_string += input_char; // /= token
}
break;
case ':': // Looking for either : or ::
input_char = stream.peek();
if ( input_char == ':' ) {
input_char = stream.get();
punctuation_string += input_char; // :: token
}
break;
case '<': // Looking for either < or <=, <<, or <<=
input_char = stream.peek();
if ( input_char == '=' ) {
input_char = stream.get();
punctuation_string += input_char; // <= token
break;
}
if ( input_char == '<' ) {
input_char = stream.get();
punctuation_string += input_char; // << token
input_char = stream.peek();
if ( input_char == '=' ) {
input_char = stream.get();
punctuation_string += input_char; // <<= token
}
}
break;
case '=': // Looking for either = or ==
input_char = stream.peek();
if ( input_char == '=' ) {
input_char = stream.get();
punctuation_string += input_char; // == token
}
break;
case '>': // Looking for either >, >=, >>, or >>=
input_char = stream.peek();
if ( input_char == '=' ) {
input_char = stream.get();
punctuation_string += input_char; // >= token
break;
}
if ( input_char == '>' ) {
input_char = stream.get();
punctuation_string += input_char; // >> token
input_char = stream.peek();
if ( input_char == '=' ) {
input_char = stream.get();
punctuation_string += input_char; // >>= token
}
}
break;
case '|': // Looking for either |, |=, or ||
input_char = stream.peek();
if ( input_char == '|' || input_char == '=' ) {
input_char = stream.get();
punctuation_string += input_char; // || or |= token
}
break;
}
input_char = stream.get();
return input_char;
}
// print the token to cout
void punctuation_token::print_token() {
cout << "TOKEN[\"punctuation\" , \"" << punctuation_string << "\"]" << endl;
}
// parse the whitespace characters
int whitespace_token::parse_token(fstream& stream, int input_char) {
while ( true ) {
input_char = stream.get();
if ( input_char == ' ' || input_char == 0x09 || input_char == 0x0B || input_char == 0x0D ) {
continue;
}
return input_char;
}
}
// print the token to cout
void whitespace_token::print_token() {
cout << "TOKEN[\"whitespace\" , \" \"]" << endl;
}
// parse the eol character
int eol_token::parse_token(fstream& stream, int input_char) {
while ( true ) {
input_char = stream.get();
return input_char;
}
}
// print the token to cout
void eol_token::print_token() {
cout << "TOKEN[\"EOL\"]" << endl;
}
// parse the eof character
int eof_token::parse_token(fstream& stream, int input_char) {
return 0;
}
// print the token to cout
void eof_token::print_token(void) {
cout << "TOKEN[\"EOF\"]" << endl;
}
// parse the invalid character
int invalid_token::parse_token(fstream& stream, int input_char) {
invalid_character = input_char;
input_char = stream.get();
return input_char;
}
// print the token to cout
void invalid_token::print_token(void) {
cout << "TOKEN[\"INVALID\"" << invalid_character << endl;
}
// parse the input source
bool token_parser::parse_tokens() {
base_token * token;
while ( !source_stream.eof() ) {
int input_char = source_stream.get();
// Determine what the leading character is of the sequence,
// create an appropriate token and get the actual token
// class to parse the rest of it (if any)
while ( !source_stream.eof() ) {
// The following do loop is there only because I hate seeing
// if () ... else if () ... else if () ... code!!!
// Hence it's a do ... while ( false ) - single shot
do
{
// Remove any comments from the source
if ( input_char == '/' ) {
int peek_character = source_stream.peek();
if ( peek_character == '/' ) {
// Remove the line comment
while ( peek_character != 0x0A && !source_stream.eof() ) {
peek_character = source_stream.get();
}
token = new(nothrow) eol_token;
break;
}
if ( peek_character == '*' ) {
// Remove a block comment
while ( true ) {
peek_character = source_stream.get();
if ( peek_character == -1 ) {
cout << "error: block comment not terminated before EOF" << endl;
exit(0);
}
if ( peek_character == 0x0A ) {
token = new(nothrow) eol_token;
// Add the token to the end of the list
token_list.push_back(token);
continue;
}
if ( peek_character == '*' ) {
peek_character = source_stream.get();
if ( peek_character == -1 ) {
cout << "error: block comment not terminated before EOF" << endl;
exit(0);
}
if ( peek_character == '/' ) {
// We need to ensure that a whitespace token
// is created to ensure /* */ in the middle
// of a source line is processed correctly.
input_char = source_stream.get();
input_char = ' ';
token = new(nothrow) whitespace_token;
break;
}
}
}
}
}
if ( isalpha(input_char) || input_char == '_' ) {
// Start of a symbol sequence
token = new(nothrow) symbol_token;
break;
}
if ( input_char == 0x0A ) {
// EOL
token = new(nothrow) eol_token;
break;
}
if ( isspace(input_char) ) {
// Start of whitespace sequence
token = new(nothrow) whitespace_token;
break;
}
if ( input_char == '\"' ) {
// Start of literal sequence
token = new(nothrow) literal_token;
break;
}
if ( input_char == '\'' ) {
// Start of constant literal sequence
token = new(nothrow) const_literal_token;
break;
}
if ( isdigit(input_char) ) {
// Start of number sequence
token = new(nothrow) integer_token;
break;
}
if ( ispunct(input_char) ) {
// Start of punctuation sequence
token = new(nothrow) punctuation_token;
break;
}
}
while ( false );
if ( token == NULL ) return false;
input_char = token->parse_token(source_stream, input_char);
// Add the token to the end of the list
token_list.push_back(token);
continue;
}
}
// Add the EOF token to the end of the list
token = new(nothrow) eof_token;
token_list.push_back(token);
return true;
}
// Simply iterate through the list of tokens and print them to cout
// Of course, get the token object to print itself :o)
void token_parser::print_tokens() {
list<base_token *>::iterator iterator;
iterator = token_list.begin();
while( iterator != token_list.end() ) {
(*iterator)->print_token();
++iterator;
}
}
// main program entry point
int main(int argc, char *argv[])
{
// Check to see that we have at least a filename
if ( argc < 2 ) {
cout << "Invalid command line arguments: need filename" << endl;
_exit(0);
}
string filename = argv[argc-1];
fstream source;
// ope the source file
source.open(filename.c_str(), ios_base::in);
if ( source.fail() ) {
cout << "An error has occurred whilst opening "<< filename << endl;
_exit(0);
}
// Create the token list
token_parser parser(source);
parser.parse_tokens();
parser.print_tokens();
}
Parsing a symbol using the symbol_token class[/b]
A symbol within the context of the C++ preprocessor is any group of characters that begins with a character in the range [A..Z] or [a..z] or underscore (_). The method symbol_token::parse_token expects this first character to be provided in the method call.
// parse the rest of a symbol
int symbol_token::parse_token(fstream& stream, int input_char) {
symbol = input_char;
while ( true ) {
input_char = stream.get();
if ( isalpha(input_char) || isdigit(input_char) || input_char == '_' ) {
symbol += input_char;
continue;
}
return input_char;
}
}
The method reads each character from the input stream and providing it is in the range [A..Z], [a..z], [0..9] or underscore, it is appended to the symbol string. When a character is read that cannot be part of a symbol name, it is simply returned to the caller.
Parsing an integer using the integer_token class
An integer within the context of the C++ preprocessor is any group of characters that begins with a character in the range [0..9]. The method integer_token::parse_token expects this first character to be provided in the method call. The code must use look-ahead in order that a '0x' or '0X' declaration does not end the integer sequence after the '0', but continues to parse the remainder of the hexadecimal declaration.
// parse the rest of an integer
int integer_token::parse_token(fstream& stream, int input_char) {
integer_string = input_char;
if ( input_char == '0' )
{
input_char = stream.peek();
if ( input_char == 'X' || input_char == 'x' ) {
integer_string += input_char;
input_char = stream.get();
while ( true ) {
input_char = stream.get();
if ( isxdigit(input_char) ) {
integer_string += input_char;
continue;
}
return input_char;
}
}
}
while ( true ) {
input_char = stream.get();
if ( isdigit(input_char) ) {
integer_string += input_char;
continue;
}
return input_char;
}
}
The method reads each character from the input stream and providing it is in the range [0..9] or [0..9][A-F][a..f] (if we are parsing a hexadecimal), it is appended to the integer string. When a character is read that cannot be part of a integer, it is simply returned to the caller.
Parsing a literal using the literal_token class
Parsing a literal between quotes may at first seem simple enough. We just simply append each character to the literal string until we reach the closing quotation marks. To do that fine if we did not have escape sequences that can define double quotation marks using the backslash double quotation mark sequence (\"). If that was not enough, we also have to contend with the backslash backslash sequence. Consider the following "\"The cat sat on the mat\\". We have to deal with the '\\' sequence as a complete escape sequence before we terminate on the closing quotes (i.e we cannot mistake the characters as backslash followed by a \").
// parse the rest of a literal
int literal_token::parse_token(fstream& stream, int input_char) {
literal_string.clear();
while ( true ) {
input_char = stream.get();
if ( input_char == '\\' ) {
input_char = stream.peek();
if ( input_char == '\"' || input_char == '\\' ) {
literal_string += '\\';
input_char = stream.get();
literal_string += input_char;
continue;
}
if ( input_char == 0x0A ) {
cout << "error: EOL encountered before closing literal quotes" << endl;
exit(0);
}
if ( input_char == -1 ) {
cout << "error: EOF encountered before closing literal quotes" << endl;
exit(0);
}
literal_string += input_char;
continue;
}
if ( input_char != '\"' && input_char != -1 ) {
literal_string += input_char;
continue;
}
if ( input_char == -1 ) {
cout << "error: EOF encountered before closing literal quotes" << endl;
exit(0);
}
input_char = stream.get();
return input_char;
}
}
Notice also the albeit rather harsh termination of the parser if we encounter either the end of file or a new line in the literal before we encounter the closing double quotes. I have written the example is this way, as I do not wish to cover error recovery and back-tracking (this may well appear as another tutorial at some later point in time).
Parsing a quoted constant using the const_literal_token class
This method follows pretty much the same principles as the literal_token class except that I have not included the error checking in the method. This is specifically, so you can do it when you are messing with the code. It would follow exactly the same lines as the literal_token class.
Parsing punctuation using the punctuation_token class
This is quite a lengthy method but follows the same peek rules as we have already seen. Hopefully, it speaks for itself, so I will not go into much more detail. One thing that does need to be pointed out though is the acceptance of the .. (double period). Did the coder intend a single period or a ... (triple period) known as ellipsis. A true compiler may accept the invalid sequence at this stage, and later complain about it, or produce an error now. I have chosen to ignore the error here.
Parsing whitespace, EOL, EOF and invalid characters
These routines speak for themselves.
Driving the parser using the token_parser class
The token_parser class method parse_tokens is the driving force for the preprocessor token parser. If you follow the code, you will notice that much of the code is straight forward, although I need to explain the removal of block comments. In order for subsequent phases of a C++ compiler to provide error messages including line numbers, we cannot simply ignore end of line's just because they are part of a block comment. Here, therefore I have written the code such that when we are in a block comment, any end of line characters are actually tokenised and appended to the list of tokens representing the source being parsed. In this way, we do not lose the correct number of end of line characters and error reporting for a line will have the correct line number.
Printing tokens
The token_parser method print_tokens, simply iterates through the list of tokens and asks each token in turn to print itself to the console, in the form TOKEN["token_type" , "token_contents"]. For example, here is a small sample output for the code
#include <iostream>
int main((int argc, char *argv[]) {
cout << "Hello world!" << endl;
}
Token output is
TOKEN["punctuation" , "#"]
TOKEN["symbol" , "include"]
TOKEN["whitespace" , " "]
TOKEN["punctuation" , "<"]
TOKEN["symbol" , "iostream"]
TOKEN["punctuation" , ">"]
TOKEN["EOL"]
TOKEN["symbol" , "int"]
TOKEN["whitespace" , " "]
TOKEN["symbol" , "main"]
TOKEN["punctuation" , "("]
TOKEN["punctuation" , "("]
TOKEN["symbol" , "int"]
TOKEN["whitespace" , " "]
TOKEN["symbol" , "argc"]
TOKEN["punctuation" , ","]
TOKEN["whitespace" , " "]
TOKEN["symbol" , "char"]
TOKEN["whitespace" , " "]
TOKEN["punctuation" , "*"]
TOKEN["symbol" , "argv"]
TOKEN["punctuation" , "["]
TOKEN["punctuation" , "]"]
TOKEN["punctuation" , ")"]
TOKEN["whitespace" , " "]
TOKEN["punctuation" , "{"]
TOKEN["EOL"]
TOKEN["whitespace" , " "]
TOKEN["symbol" , "cout"]
TOKEN["whitespace" , " "]
TOKEN["punctuation" , "<<"]
TOKEN["whitespace" , " "]
TOKEN["literal" , "Hello world!"]
TOKEN["whitespace" , " "]
TOKEN["punctuation" , "<<"]
TOKEN["whitespace" , " "]
TOKEN["symbol" , "endl"]
TOKEN["punctuation" , ";"]
TOKEN["EOL"]
TOKEN["punctuation" , "}"]
TOKEN["EOL"]
TOKEN["EOF"]
Taking the code further
You are free to copy this code and do with it as you please. You might like to modify the code such that it counts the number of individual token types in a given source file. You might like to have a go at extending the code to identify keywords, create a cross-reference program that tells you on which lines symbols are used. The possibilities are endless.
Writing your own parsers for other languages
If you have understood the fundamentals behind this parser, try writing your own parser for say Visual Basic or Java, or even your own language. I am always here to help should you run into problems.





MultiQuote






|