Page 1 of 1

Writing an Interpreter - Part II BASIC source line parsing Rate Topic: ***** 1 Votes

#1 Martyn.Rae  Icon User is offline

  • The programming dinosaur
  • member icon

Reputation: 540
  • View blog
  • Posts: 1,406
  • Joined: 22-August 09

Posted 05 February 2010 - 08:53 AM

Writing An Interpreter - Part II

The Basic Language Parser

For those of you that have just jumped in here, please make sure that you have read my tutorial on the Fundamentals of Parsing and Writing An Interpreter - Part I.

The second part of my tutorial covers the Basic language parser. This is a very simplified form of the C++ parser provided in my previous tutorial, as the Basic language is pretty basic :o)

Anybody who has not seen Basic before, I would firstly like to know where you have been all this time? In a nutshell, Basic is a reasonably powerful (not to offend VB users) programming language which provides very simple constructs. In my Basic, you do not need to declare variables before you assign a value to them, and conversion between types is automatic.

All string variables end with a dollar sign ($), so A$, string$ etc are all examples of string variables. Integer variables end with a percent sign (%), so Value%, A% etc are all examples of integer variables.
Finally, real numbers are variables without any special trailing character, so A, FLoat etc are all real.

Basic does not care about case, so Value, VALue, Value, vaLUE are all the same variable.

It should also be noted that A, A% and A$ are three distinct variables, not the same variable defined as real, integer and string within the same program.

Keywords

This Basic will contain the following keywords:
LET, FOR, TO, STEP, NEXT, DIM, IF, THEN, ELSE, ENDIF, READ, DATA, END, FUNCTION, RETURN, PROGRAM,INPUT, PRINT, INT, LEFT$, RIGHT$, MID$, CHR, ASC, PROCEDURE, RANDOMIZE, REM and RND. I will not go into any detail as to what these keywords do, as this tutorial is concerned with parsing the input into tokens.

Modifications to the VETLINE Line Editor

In the previous tutorial, for those of you that studied it, you will have noticed that line input was held in a vector that contained strings.

#ifndef __LINE_EDITOR_INCLUDED__
#define __LINE_EDITOR_INCLUDED__

#pragma once

class line_editor {
        private:
                vector<string> line_table;
                vector<string>::iterator iterator;
                long current_line;
        public:
                line_editor() : current_line(-1) { };
                void command(string command);
};

#endif



This has been extended to cater for strings and tokens as shown below.

#ifndef __LINE_EDITOR_INCLUDED__
#define __LINE_EDITOR_INCLUDED__

#pragma once

class line {
	public:
		string source_line;                // Source line as input by user
		list<base_token *> token_list;     // A list of tokens that are found in the source line
};

class line_editor {
	private:
		vector<line *> line_table;
		vector<line *>::iterator iterator;
		line_parser * basic_parser;
		long current_line;
	public:
		line_editor() : current_line(-1) { basic_parser = new(nothrow) line_parser; };
		bool command(string command);
};

#endif



When the user inserts a line of text and presses enter, the parser parses that line of input into tokens in exactly the same way as the C++ example in my previous tutorial. Notice in the class line
we store not only the source line itself, but also the tokens in a list. When the complete line has been parsed and all tokens extracted, the tokens themselves will be parsed for syntactic correctness, but this will be discussed in Part III of this tutorial.

Basic Tokens

Basic consists of five token types, namely symbol, integer, literal, punctuation and keyword. The first four will not be alien to those of you that have read the Fundamentals of Parsing tutorial. The keyword token is however new to us. When the token parser is parsing for tokens, any symbol token is checked to see if the symbol is a keyword, a keyword token generated, and the symbol token deleted. It is the keyword token that is then added to the keyword list. The code that performs that little magic act is provided below.

	if ( token->get_type() == base_token::t_symbol ) {
		keyword_token * this_token = new(nothrow) keyword_token;
		if ( this_token->parse_token(static_cast<symbol_token *>(token)->get_symbol()) == NULL ) {
			delete token;
			token = this_token;
		}
		else
			delete this_token;
	}



Error Recovery

During the parse, the only error that can be reported is if a literal is not ended with double quotes before the end of the line is reached. An error message is reported and the user must input the line again. (A neat feature of the Windows command prompt is that if you use the up-arrow, the line reappears and can be edited).

The LineEditor.h file has been included above, here is the BasicParser.h file.

#ifndef __TOKEN_DEFINED__
#define __TOKEN_DEFINED__

class line;

// All tokens must derive from this token type

class base_token
{
	public:
		typedef enum { 	t_invalid_token=0, t_symbol, t_integer, t_literal, 
						t_punctuation, t_keyword
		             } type_of_token;
	private:
		type_of_token token_type;
	public:
		base_token(type_of_token token) : token_type(token) { };
		type_of_token get_type() { return token_type; };
		virtual const char *  parse_token(const char * s) = 0;
};

// A token that may contain a symbol

class symbol_token : public base_token
{
	private:
		string symbol;
	public:
		symbol_token() : base_token(t_symbol) { };
		const char * parse_token(const char * s);
		const char * get_symbol() { return symbol.c_str(); };
};

// A token that will contain a keyword

class keyword_token : public base_token
{
	private:
		typedef enum { 	t_let=0, t_for, t_to, t_step, t_next, t_dim,
						t_if, t_then, t_else, t_endif, t_read,
						t_data, t_end, t_function, t_return, t_program,
						t_input, t_print, t_int, t_left_dollar,
						t_right_dollar, t_mid_dollar, t_chr, t_asc, 
						t_procedure, t_end_procedure, t_randomize, t_rem,
						t_rnd 
		     		 }  type_of_keyword;
		static char * keywords[];
		type_of_keyword keyword;
	public:
		keyword_token() : base_token(t_keyword) { };
		const char * parse_token(const char * s);
};

// A token that represents an integer

class integer_token : public base_token
{
	private:
		string integer_string;
	public:
		integer_token() : base_token(t_integer) { };
		const char * parse_token(const char * s);
};

// A token that represents a literal

class literal_token : public base_token
{
	private:
		string literal_string;
	public:
		literal_token() : base_token(t_literal) { };
		const char * parse_token(const char * s);
};

// A token that represents a punctuation or separator

class punctuation_token : public base_token
{
	private:
		string punctuation_string;
	public:
		punctuation_token() : base_token(t_punctuation) { };
		const char * parse_token(const char * s);
};

// The BASIC token line

class line_parser
{
	public:
		bool parse_string(line * source_line);
};

#endif



Here is the LineEditor.cpp file.

// VEry Tiny LINe Editor - version 1.1
#include <fstream>
#include <iostream>
#include <string>
#include <list>
#include <vector>

using namespace std;

#include <cstdio>
#include "BASIC Parser.h"
#include "LineEditor.h"

// VETLINE editor - this is all straight forward code
bool line_editor::command(string command_line) {
	char line_buffer[32];
	unsigned int command_line_length;
	unsigned char command_char;
	int command_count = 0;

	// Get the command from the command line
	command_line_length = command_line.length();
	if ( command_line_length == 0 ) return false;
	
	// Command to upper case character
	command_char = (command_line[0]&0xDF);
	
	// Get the command count if there is one
	if ( command_line_length >= 1 ) {
		unsigned int char_index = 1;
		while ( char_index < command_line_length ) {
			unsigned char this_char = command_line[char_index++];
			if ( this_char >= '0' && this_char <= '9' ) {
				command_count = (command_count*10) + (this_char - '0');
			}
		}
	}
	
	// Default to a command count of one if the user
	// has not given us a count
	if ( command_count == 0 ) command_count = 1;
	
	// If the command is not insert and the list is empty
	// then inform the user of such and return to caller
	if ( current_line == -1 && command_char != 'I' ) {
		cout << "The file is empty" << endl;
		return false;
	}

	// Process the command based upon the letter entered
	// We simply ignore commands that we do not recognise
	// for this version
	switch ( command_char ) {
	case 'Q': // Quit VETLINE
		{
			return true;
		}
	case 'U': // More up one or more lines
		{
			if ( current_line == 0 ) {
				cout << "At top of file" << endl;
				return false;
			}
			current_line -= command_count;
			if ( current_line < 0 ) current_line = 0;
			sprintf_s(line_buffer, "%8d* ", current_line + 1);
			cout << line_buffer << line_table[current_line]->source_line << endl;
			return false;
		}
	case 'D': // More down one or more lines
		{
			if ( (current_line+1) == line_table.size() ) {
				cout << "At end of file" << endl;
				return false;
			}
			current_line += command_count;
			if ( (current_line+1) >= (long)line_table.size() ) current_line = line_table.size()-1;
			sprintf_s(line_buffer, "%8d* ", current_line + 1);
			cout << line_buffer << line_table[current_line]->source_line << endl;
			return false;
		}
		case 'I':  // Insert lines
		{
			string insertion_string;
			line * insertion_line;
			int line_number = current_line+1;
			int start_number = line_number;
			
			while ( true ) {
				current_line++;
				sprintf_s(line_buffer, "%8d  ", line_number + 1);
				cout << line_buffer;
				getline(cin, insertion_string);
				if ( cin.eof() ) break;
				insertion_line = new(nothrow) line;
				insertion_line->source_line = insertion_string;
				if ( basic_parser->parse_string(insertion_line) == true ) {
					line_table.insert(line_table.begin()+current_line, insertion_line);
					line_number++;
				}
				else {
					current_line--;
				}
			}
			cin.clear();
			current_line = start_number;
			return false;
		}
	case 'R': // Remove one line at a time (count is ignored for now)
		{
			if ( (current_line+1) == line_table.size() ) {
				if ( current_line == 0 ) {
					line_table.erase(line_table.begin()+current_line);
					current_line = -1;
					return false;
				}
				cout << "At end of file" << endl;
				return false;
			}
			line_table.erase(line_table.begin()+current_line);
			return false;
		}
	case 'S': // Show lines (count is ignored for now)
		{
			string list_string;
			int line_number = 1;

			iterator = line_table.begin();
    		while( iterator != line_table.end() ) {
				list_string = (*iterator)->source_line;
				if ( iterator == line_table.begin()+current_line )
					sprintf_s(line_buffer, "%8d* ", line_number++);
				else
					sprintf_s(line_buffer, "%8d  ", line_number++);
    	  		cout << line_buffer << list_string << endl;
    	  		++iterator;
			}
			return false;
		}
	case 'C': // Show current line
		{
			sprintf_s(line_buffer, "%8d  ", current_line + 1);
			cout << line_buffer << line_table[current_line]->source_line << endl;
			return false;
		}
	case 'T': // Move to top of source
		{
			current_line = -1;
			return false;
		}
	case 'B': // Move to bottom of source
		{
			current_line = line_table.size();
			return false;
		}
    }
	return false;
}



Here is the Basic Parser.cpp file.

/*
	The following code parses BASIC source and prints out a list of tokens.
	The code is intended to be used as a part of a BASIC interpreter, and
	nothing	else. Having said that, it must be pointed out that whilst this
	parser does correctly identify and store BASIC tokens.

	The code has been compiled and tested under the Microsoft VC++ version
	10, B2 Release and no guarantee can be made as to it's compatibility with
	either other versions of VC++, or any other C++ compiler.

	If you have any problems, with this code please do not hesitate to ask.
*/

#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <list>
#include <cctype>

using namespace std;
#include "BASIC Parser.h"
#include "LineEditor.h"

char * keyword_token::keywords[] = { "LET", "FOR", "TO", "STEP", "NEXT", "DIM",
				     "IF", "THEN", "ELSE", "ENDIF", "READ",
				     "DATA", "END", "FUNCTION", "RETURN", "PROGRAM",
				     "INPUT", "PRINT", "INT", "LEFT$", "RIGHT$",
				     "MID$", "CHR", "ASC", "PROCEDURE", "RANDOMIZE",
				     "REM", "RND" };
#define keyword_count sizeof(keywords)/sizeof(char *)

// parse the rest of a symbol
const char * symbol_token::parse_token(const char * s) {
	symbol = *s++;
	while ( true ) {
		unsigned char input_char = *s++;
		if ( isalpha(input_char) || isdigit(input_char) ) {
			symbol += input_char;
			continue;
		}
		if (  input_char == '$' || input_char == '%' ) {
			symbol += input_char;
			return s++;
		}
		return s;
	}
}

// compare the parameter s for a keyword
const char * keyword_token::parse_token(const char * s) {
	int keyword_index;
	for ( keyword_index = 0; keyword_index < keyword_count; keyword_index++ ) {
		if ( strcmp(s, keywords[keyword_index]) == 0 ) {
			keyword = (type_of_keyword)keyword_index;
			return NULL;
		}
	}
	return s;
}

// parse the rest of an integer
const char * integer_token::parse_token(const char * s) {
	unsigned char input_char;

	integer_string = *s++;
	while ( true ) {
		input_char = *s++;
		if ( isdigit(input_char) ) {
			integer_string += input_char;
			continue;
		}
		return s;
	}
}

// parse the rest of a literal
const char * literal_token::parse_token(const char * s) {
	unsigned char input_char;

	literal_string.clear();
	s++;
	while ( true ) {
		input_char = *s++;
		if ( input_char == 0 ) {
			cout << "          error: missing literal quotes" << endl;
			return NULL;
		}
		if ( input_char != '\"' ) {
			literal_string += input_char;
			continue;
		}
		return s;
	}
}

// parse the rest of a punctuation sequence - this consists of
// ending up with either one, two or three characters in the
// punctuation string. NB: The sequence .. is accepted as a 
// punctuation token, but must be rejected by the compiler at
// some later stage.
const char * punctuation_token::parse_token(const char * s) {
	unsigned char input_char= *s++;

	punctuation_string = input_char;
	switch ( input_char ) {
	case '<': // Looking for either <, <> or <=
		input_char = *(s+1);
		if ( input_char == '>'  || input_char == '=' ) {
			input_char = *s++;
			punctuation_string += input_char; // <> or <= token
			break;
		}
		break;
	case '>': // Looking for either > or >=
		input_char = *(s+1);
		if ( input_char == '=' ) {
			input_char = *s++;
			punctuation_string += input_char; // >= token
			break;
		}
		break;
	}
	return s++;
}

// parse the input source
bool line_parser::parse_string(line * source_line) {
	base_token * token;
	
	const char * character_pointer = source_line->source_line.c_str();
	source_line->token_list.clear();
	int line_length = source_line->source_line.length();
	while ( true ) {
		int input_char;

		if ( *character_pointer == 0 ) break;
		while ( *character_pointer == ' ' || *character_pointer == 0x09 ) character_pointer++;
		input_char = *character_pointer;
		// Determine what the leading character is of the sequence,
		// create an appropriate token and get the actual token
		// class to parse the rest of it (if any)

		while ( true ) {
			if ( isalpha(input_char) ) {
				// Start of a symbol sequence
				token = new(nothrow) symbol_token;
				break;
			}
			if ( input_char == '\"' ) {
				// Start of literal sequence
				token = new(nothrow) literal_token;
				break;
			}
			if ( isdigit(input_char) ) {
				// Start of number sequence
				token = new(nothrow) integer_token;
				break;
			}
			if ( ispunct(input_char) ) {
				// Start of punctuation sequence
				token = new(nothrow) punctuation_token;
				break;
			}
			break;
		}
		if ( token == NULL ) return false;
		character_pointer = token->parse_token(character_pointer);
		if ( character_pointer == NULL ) return false;
		if ( token->get_type() == base_token::t_symbol ) {
			keyword_token * this_token = new(nothrow) keyword_token;
			if ( this_token->parse_token(static_cast<symbol_token *>(token)->get_symbol()) == NULL ) {
				delete token;
				token = this_token;
			}
			else
				delete this_token;
		}
		// Add the token to the end of the list
		source_line->token_list.push_back(token);
	}
	return true;
}



And filally, the main.cpp file.

#include <fstream>
#include <iostream>
#include <string>
#include <vector>
#include <list>

using namespace std;

#include <cstdio>
#include "BASIC Parser.h"
#include "LineEditor.h"

// Main program entry point
int main(int argc, char *argv[]) {
	string command_line;
	line_editor editor;
	cout << "VEtLINe BASIC line editor - version 1.1" << endl << endl;
	while ( true ) {
		// Output the command prompt
		cout << "[Basic]:  ";
		getline(cin, command_line);
		if ( editor.command(command_line) == true ) break;
	}
}



Part III of this tutorial will concern itself with checking the syntactic correctness of the code, building an Abstract Syntax Tree. Hold on to your seats even tighter. This is where the ride get's really interesting.

Is This A Good Question/Topic? 3
  • +

Replies To: Writing an Interpreter - Part II

#2 eker676  Icon User is offline

  • Software Engineer
  • member icon

Reputation: 378
  • View blog
  • Posts: 1,833
  • Joined: 18-April 09

Posted 10 February 2010 - 11:18 AM

Wow, nice job. These are actually understandable unlike some of the reference out there.
Was This Post Helpful? 0
  • +
  • -

#3 Gebriel  Icon User is offline

  • New D.I.C Head

Reputation: 3
  • View blog
  • Posts: 23
  • Joined: 26-January 09

Posted 13 March 2010 - 01:34 AM

Yeah this is a very nice tutorial.
Was This Post Helpful? 0
  • +
  • -

#4 xrick  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 1
  • Joined: 27-September 09

Posted 23 October 2010 - 09:27 PM

That's really a great tutorial of introducing how to get in this topic. Thanks a lot.
Was This Post Helpful? 0
  • +
  • -

#5 vividexstance  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 659
  • View blog
  • Posts: 2,270
  • Joined: 31-December 10

Posted 04 February 2011 - 01:08 PM

First off, I would like to thank you for all your contributions here, they've been really helpful to me. I have a question though, I see many new statements, so we're obviously using dynamic memory allocation, however, I don't see where any of this dynamic memory is deallocated. Maybe I'm wrong, but I thought when you use dynamic memory allocation that you need to delete them when you're done with them or else it will cause memory leaks. I have also seen this in other tutorials that have thousands of views, so I would think that you don't want to teach people how to use dynamic memory allocation incorrectly.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1