The Basic Language Parser
For those of you that have just jumped in here, please make sure that you have read my tutorial on the Fundamentals of Parsing and Writing An Interpreter - Part I.
The second part of my tutorial covers the Basic language parser. This is a very simplified form of the C++ parser provided in my previous tutorial, as the Basic language is pretty basic :o)
Anybody who has not seen Basic before, I would firstly like to know where you have been all this time? In a nutshell, Basic is a reasonably powerful (not to offend VB users) programming language which provides very simple constructs. In my Basic, you do not need to declare variables before you assign a value to them, and conversion between types is automatic.
All string variables end with a dollar sign ($), so A$, string$ etc are all examples of string variables. Integer variables end with a percent sign (%), so Value%, A% etc are all examples of integer variables.
Finally, real numbers are variables without any special trailing character, so A, FLoat etc are all real.
Basic does not care about case, so Value, VALue, Value, vaLUE are all the same variable.
It should also be noted that A, A% and A$ are three distinct variables, not the same variable defined as real, integer and string within the same program.
Keywords
This Basic will contain the following keywords:
LET, FOR, TO, STEP, NEXT, DIM, IF, THEN, ELSE, ENDIF, READ, DATA, END, FUNCTION, RETURN, PROGRAM,INPUT, PRINT, INT, LEFT$, RIGHT$, MID$, CHR, ASC, PROCEDURE, RANDOMIZE, REM and RND. I will not go into any detail as to what these keywords do, as this tutorial is concerned with parsing the input into tokens.
Modifications to the VETLINE Line Editor
In the previous tutorial, for those of you that studied it, you will have noticed that line input was held in a vector that contained strings.
#ifndef __LINE_EDITOR_INCLUDED__
#define __LINE_EDITOR_INCLUDED__
#pragma once
class line_editor {
private:
vector<string> line_table;
vector<string>::iterator iterator;
long current_line;
public:
line_editor() : current_line(-1) { };
void command(string command);
};
#endif
This has been extended to cater for strings and tokens as shown below.
#ifndef __LINE_EDITOR_INCLUDED__
#define __LINE_EDITOR_INCLUDED__
#pragma once
class line {
public:
string source_line; // Source line as input by user
list<base_token *> token_list; // A list of tokens that are found in the source line
};
class line_editor {
private:
vector<line *> line_table;
vector<line *>::iterator iterator;
line_parser * basic_parser;
long current_line;
public:
line_editor() : current_line(-1) { basic_parser = new(nothrow) line_parser; };
bool command(string command);
};
#endif
When the user inserts a line of text and presses enter, the parser parses that line of input into tokens in exactly the same way as the C++ example in my previous tutorial. Notice in the class line
we store not only the source line itself, but also the tokens in a list. When the complete line has been parsed and all tokens extracted, the tokens themselves will be parsed for syntactic correctness, but this will be discussed in Part III of this tutorial.
Basic Tokens
Basic consists of five token types, namely symbol, integer, literal, punctuation and keyword. The first four will not be alien to those of you that have read the Fundamentals of Parsing tutorial. The keyword token is however new to us. When the token parser is parsing for tokens, any symbol token is checked to see if the symbol is a keyword, a keyword token generated, and the symbol token deleted. It is the keyword token that is then added to the keyword list. The code that performs that little magic act is provided below.
if ( token->get_type() == base_token::t_symbol ) {
keyword_token * this_token = new(nothrow) keyword_token;
if ( this_token->parse_token(static_cast<symbol_token *>(token)->get_symbol()) == NULL ) {
delete token;
token = this_token;
}
else
delete this_token;
}
Error Recovery
During the parse, the only error that can be reported is if a literal is not ended with double quotes before the end of the line is reached. An error message is reported and the user must input the line again. (A neat feature of the Windows command prompt is that if you use the up-arrow, the line reappears and can be edited).
The LineEditor.h file has been included above, here is the BasicParser.h file.
#ifndef __TOKEN_DEFINED__
#define __TOKEN_DEFINED__
class line;
// All tokens must derive from this token type
class base_token
{
public:
typedef enum { t_invalid_token=0, t_symbol, t_integer, t_literal,
t_punctuation, t_keyword
} type_of_token;
private:
type_of_token token_type;
public:
base_token(type_of_token token) : token_type(token) { };
type_of_token get_type() { return token_type; };
virtual const char * parse_token(const char * s) = 0;
};
// A token that may contain a symbol
class symbol_token : public base_token
{
private:
string symbol;
public:
symbol_token() : base_token(t_symbol) { };
const char * parse_token(const char * s);
const char * get_symbol() { return symbol.c_str(); };
};
// A token that will contain a keyword
class keyword_token : public base_token
{
private:
typedef enum { t_let=0, t_for, t_to, t_step, t_next, t_dim,
t_if, t_then, t_else, t_endif, t_read,
t_data, t_end, t_function, t_return, t_program,
t_input, t_print, t_int, t_left_dollar,
t_right_dollar, t_mid_dollar, t_chr, t_asc,
t_procedure, t_end_procedure, t_randomize, t_rem,
t_rnd
} type_of_keyword;
static char * keywords[];
type_of_keyword keyword;
public:
keyword_token() : base_token(t_keyword) { };
const char * parse_token(const char * s);
};
// A token that represents an integer
class integer_token : public base_token
{
private:
string integer_string;
public:
integer_token() : base_token(t_integer) { };
const char * parse_token(const char * s);
};
// A token that represents a literal
class literal_token : public base_token
{
private:
string literal_string;
public:
literal_token() : base_token(t_literal) { };
const char * parse_token(const char * s);
};
// A token that represents a punctuation or separator
class punctuation_token : public base_token
{
private:
string punctuation_string;
public:
punctuation_token() : base_token(t_punctuation) { };
const char * parse_token(const char * s);
};
// The BASIC token line
class line_parser
{
public:
bool parse_string(line * source_line);
};
#endif
Here is the LineEditor.cpp file.
// VEry Tiny LINe Editor - version 1.1
#include <fstream>
#include <iostream>
#include <string>
#include <list>
#include <vector>
using namespace std;
#include <cstdio>
#include "BASIC Parser.h"
#include "LineEditor.h"
// VETLINE editor - this is all straight forward code
bool line_editor::command(string command_line) {
char line_buffer[32];
unsigned int command_line_length;
unsigned char command_char;
int command_count = 0;
// Get the command from the command line
command_line_length = command_line.length();
if ( command_line_length == 0 ) return false;
// Command to upper case character
command_char = (command_line[0]&0xDF);
// Get the command count if there is one
if ( command_line_length >= 1 ) {
unsigned int char_index = 1;
while ( char_index < command_line_length ) {
unsigned char this_char = command_line[char_index++];
if ( this_char >= '0' && this_char <= '9' ) {
command_count = (command_count*10) + (this_char - '0');
}
}
}
// Default to a command count of one if the user
// has not given us a count
if ( command_count == 0 ) command_count = 1;
// If the command is not insert and the list is empty
// then inform the user of such and return to caller
if ( current_line == -1 && command_char != 'I' ) {
cout << "The file is empty" << endl;
return false;
}
// Process the command based upon the letter entered
// We simply ignore commands that we do not recognise
// for this version
switch ( command_char ) {
case 'Q': // Quit VETLINE
{
return true;
}
case 'U': // More up one or more lines
{
if ( current_line == 0 ) {
cout << "At top of file" << endl;
return false;
}
current_line -= command_count;
if ( current_line < 0 ) current_line = 0;
sprintf_s(line_buffer, "%8d* ", current_line + 1);
cout << line_buffer << line_table[current_line]->source_line << endl;
return false;
}
case 'D': // More down one or more lines
{
if ( (current_line+1) == line_table.size() ) {
cout << "At end of file" << endl;
return false;
}
current_line += command_count;
if ( (current_line+1) >= (long)line_table.size() ) current_line = line_table.size()-1;
sprintf_s(line_buffer, "%8d* ", current_line + 1);
cout << line_buffer << line_table[current_line]->source_line << endl;
return false;
}
case 'I': // Insert lines
{
string insertion_string;
line * insertion_line;
int line_number = current_line+1;
int start_number = line_number;
while ( true ) {
current_line++;
sprintf_s(line_buffer, "%8d ", line_number + 1);
cout << line_buffer;
getline(cin, insertion_string);
if ( cin.eof() ) break;
insertion_line = new(nothrow) line;
insertion_line->source_line = insertion_string;
if ( basic_parser->parse_string(insertion_line) == true ) {
line_table.insert(line_table.begin()+current_line, insertion_line);
line_number++;
}
else {
current_line--;
}
}
cin.clear();
current_line = start_number;
return false;
}
case 'R': // Remove one line at a time (count is ignored for now)
{
if ( (current_line+1) == line_table.size() ) {
if ( current_line == 0 ) {
line_table.erase(line_table.begin()+current_line);
current_line = -1;
return false;
}
cout << "At end of file" << endl;
return false;
}
line_table.erase(line_table.begin()+current_line);
return false;
}
case 'S': // Show lines (count is ignored for now)
{
string list_string;
int line_number = 1;
iterator = line_table.begin();
while( iterator != line_table.end() ) {
list_string = (*iterator)->source_line;
if ( iterator == line_table.begin()+current_line )
sprintf_s(line_buffer, "%8d* ", line_number++);
else
sprintf_s(line_buffer, "%8d ", line_number++);
cout << line_buffer << list_string << endl;
++iterator;
}
return false;
}
case 'C': // Show current line
{
sprintf_s(line_buffer, "%8d ", current_line + 1);
cout << line_buffer << line_table[current_line]->source_line << endl;
return false;
}
case 'T': // Move to top of source
{
current_line = -1;
return false;
}
case 'B': // Move to bottom of source
{
current_line = line_table.size();
return false;
}
}
return false;
}
Here is the Basic Parser.cpp file.
/*
The following code parses BASIC source and prints out a list of tokens.
The code is intended to be used as a part of a BASIC interpreter, and
nothing else. Having said that, it must be pointed out that whilst this
parser does correctly identify and store BASIC tokens.
The code has been compiled and tested under the Microsoft VC++ version
10, B2 Release and no guarantee can be made as to it's compatibility with
either other versions of VC++, or any other C++ compiler.
If you have any problems, with this code please do not hesitate to ask.
*/
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <list>
#include <cctype>
using namespace std;
#include "BASIC Parser.h"
#include "LineEditor.h"
char * keyword_token::keywords[] = { "LET", "FOR", "TO", "STEP", "NEXT", "DIM",
"IF", "THEN", "ELSE", "ENDIF", "READ",
"DATA", "END", "FUNCTION", "RETURN", "PROGRAM",
"INPUT", "PRINT", "INT", "LEFT$", "RIGHT$",
"MID$", "CHR", "ASC", "PROCEDURE", "RANDOMIZE",
"REM", "RND" };
#define keyword_count sizeof(keywords)/sizeof(char *)
// parse the rest of a symbol
const char * symbol_token::parse_token(const char * s) {
symbol = *s++;
while ( true ) {
unsigned char input_char = *s++;
if ( isalpha(input_char) || isdigit(input_char) ) {
symbol += input_char;
continue;
}
if ( input_char == '$' || input_char == '%' ) {
symbol += input_char;
return s++;
}
return s;
}
}
// compare the parameter s for a keyword
const char * keyword_token::parse_token(const char * s) {
int keyword_index;
for ( keyword_index = 0; keyword_index < keyword_count; keyword_index++ ) {
if ( strcmp(s, keywords[keyword_index]) == 0 ) {
keyword = (type_of_keyword)keyword_index;
return NULL;
}
}
return s;
}
// parse the rest of an integer
const char * integer_token::parse_token(const char * s) {
unsigned char input_char;
integer_string = *s++;
while ( true ) {
input_char = *s++;
if ( isdigit(input_char) ) {
integer_string += input_char;
continue;
}
return s;
}
}
// parse the rest of a literal
const char * literal_token::parse_token(const char * s) {
unsigned char input_char;
literal_string.clear();
s++;
while ( true ) {
input_char = *s++;
if ( input_char == 0 ) {
cout << " error: missing literal quotes" << endl;
return NULL;
}
if ( input_char != '\"' ) {
literal_string += input_char;
continue;
}
return s;
}
}
// parse the rest of a punctuation sequence - this consists of
// ending up with either one, two or three characters in the
// punctuation string. NB: The sequence .. is accepted as a
// punctuation token, but must be rejected by the compiler at
// some later stage.
const char * punctuation_token::parse_token(const char * s) {
unsigned char input_char= *s++;
punctuation_string = input_char;
switch ( input_char ) {
case '<': // Looking for either <, <> or <=
input_char = *(s+1);
if ( input_char == '>' || input_char == '=' ) {
input_char = *s++;
punctuation_string += input_char; // <> or <= token
break;
}
break;
case '>': // Looking for either > or >=
input_char = *(s+1);
if ( input_char == '=' ) {
input_char = *s++;
punctuation_string += input_char; // >= token
break;
}
break;
}
return s++;
}
// parse the input source
bool line_parser::parse_string(line * source_line) {
base_token * token;
const char * character_pointer = source_line->source_line.c_str();
source_line->token_list.clear();
int line_length = source_line->source_line.length();
while ( true ) {
int input_char;
if ( *character_pointer == 0 ) break;
while ( *character_pointer == ' ' || *character_pointer == 0x09 ) character_pointer++;
input_char = *character_pointer;
// Determine what the leading character is of the sequence,
// create an appropriate token and get the actual token
// class to parse the rest of it (if any)
while ( true ) {
if ( isalpha(input_char) ) {
// Start of a symbol sequence
token = new(nothrow) symbol_token;
break;
}
if ( input_char == '\"' ) {
// Start of literal sequence
token = new(nothrow) literal_token;
break;
}
if ( isdigit(input_char) ) {
// Start of number sequence
token = new(nothrow) integer_token;
break;
}
if ( ispunct(input_char) ) {
// Start of punctuation sequence
token = new(nothrow) punctuation_token;
break;
}
break;
}
if ( token == NULL ) return false;
character_pointer = token->parse_token(character_pointer);
if ( character_pointer == NULL ) return false;
if ( token->get_type() == base_token::t_symbol ) {
keyword_token * this_token = new(nothrow) keyword_token;
if ( this_token->parse_token(static_cast<symbol_token *>(token)->get_symbol()) == NULL ) {
delete token;
token = this_token;
}
else
delete this_token;
}
// Add the token to the end of the list
source_line->token_list.push_back(token);
}
return true;
}
And filally, the main.cpp file.
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
#include <list>
using namespace std;
#include <cstdio>
#include "BASIC Parser.h"
#include "LineEditor.h"
// Main program entry point
int main(int argc, char *argv[]) {
string command_line;
line_editor editor;
cout << "VEtLINe BASIC line editor - version 1.1" << endl << endl;
while ( true ) {
// Output the command prompt
cout << "[Basic]: ";
getline(cin, command_line);
if ( editor.command(command_line) == true ) break;
}
}
Part III of this tutorial will concern itself with checking the syntactic correctness of the code, building an Abstract Syntax Tree. Hold on to your seats even tighter. This is where the ride get's really interesting.





MultiQuote






|