Page 1 of 1

Writing An Interpreter - Part IIIB Syntax Trees Rate Topic: ***** 1 Votes

#1 Martyn.Rae  Icon User is offline

  • The programming dinosaur
  • member icon

Reputation: 540
  • View blog
  • Posts: 1,406
  • Joined: 22-August 09

Posted 06 March 2010 - 02:35 PM

Writing An Interpreter - Part IIIB

The Syntax Tree

This section of the tutorial is concerned with implementing the grammar, and producing a syntax tree of the code. The reason why this is done, is to ensure that the code presented to the interpreter actually follows the grammar. This in turn simplifies the stages that follow on.

The phase is concerned purely with syntax; the semantics of the language being checked by subsequent phases. Error reporting becomes a large issue when building the syntax tree. I have been careful to ensure that the Basic grammar presented is not ambiguous.

Grammar Conflicts, Error Reporting and Back-tracking

If we take a look at C++, we find that the language is full of grammar conflicts, seen from a parsing perspective. Take for example the following two lines of code.

    static unsigned int name = 0;
    static unsigned int name(p1) {
    }



We as programmers can see at an instant that the first line is the declaration of a static unsigned integer with the symbolic name of 'name' and it's being initialized to zero. The second is a static function declaration called 'name' which has a parameter p1 and returns an unsigned integer. Now let's look at it from a compiler or interpreters point of view.

Now lets confuse things even more. Take these two lines of code.
We start with the keyword token static. There are at least two possible routes we can take. Is this going to be a static variable declaration or is it going to be a static function? We have to follow both possible paths at the same time before we know we are dealing with a variable declaration or a function declaration.

Reading the second, third and fourth tokens does not shed light on the problem. It is only when we get to the fifth token that we can see what it is we are dealing with. If we have an open parenthesis, then we are dealing with a function otherwise we are dealing with a variable declaration.

The compiler/interpreter implements this simply by assuming one root will be the correct one, by marking the start position just before the static keyword and then rewinding back to that position if the route chosen does not work out. This is known as backtrack parsing. In languages such as C++, backtracking actually occupies the vast majority of the time taken to compile a source.

    static unsigned int name 0;
    static unsigned int name p1);



The compiler/intepreter does not have the ability to deduce that an equals sign is missing from the first statement, and will back-track to try the alternative (function declaration) route before finally discovering that did not help and a syntax error must be reported. Similarly, for the second line, an open parenthesis '(' is missing. It has no choice but to again, report a syntax error. (i.e was the second line supposed to read 'static unsigned int name(p1);' or 'static unsigned int name = p1);' where the spurious close parenthesis would be reported anyway.

What about these two lines that are clearly incorrect.

    static unsigned int int name;
    static unsigned int name(p1;



In the first line, the compiler can warn the user that 'int' has been used more than once, discard the second 'int' and still produce a valid compilation. For the second line, again, the compiler can deduce that there is a missing ')' at the end of the line and still produce a valid compilation.

So, you can see where an error occurs is important to whether the compiler/interpreter continues with errors or warnings. Obviously, once it has found an error, it cannot just stop the compilation. It simply pretends that the line did not exist and tries to do it's best with the remainder of the source. This is why one simple error can result in lots more errors being produced futher down the compilation.

The Dictionary

Although we are not going to be using a dictionary for our Basic interpreter (least not during the production of the syntax tree), it is worth discussing the dictionary and how it can be used to assist in grammar conflict resolution. The purpose of the dictionary is to provide a central holding point for information that we have collected about individual symbols up to this point in the parse. Let us take a C++ example, by considering the following statement

    class foo



The syntax parser will have read the token class, so it will have started down a route dealing with class. If we look at the C++ ISO grammar for class we have the following:

    nested-name-specifier: class-or-namespace-name :: nested-name-specifieropt

    class-name:            identifier
                           templateid

    class-specifier:       class-head { member-specificationopt }

    class-head:            class-key identifieropt base-clauseopt
                           class-key nested-name-specifier identifier base-clauseopt

    class-key:             class
                           struct
                           union



Once the syntax parser has read class we have the problem of dealing with the symbol foo. Foo, according to the grammar can be either the start of a nested-name-specifier or an identifier. This is not possible to determine unless we have retained some information about previous encounters with the symbol foo. This is where the dictionary comes into play. If we lookup foo in the dictionary and find we have not seen this symbol before, then foo is an identifier and the parse can progress. If the dictionary lookup reveals that we have encountered foo as either a forward class declaration in the form class foo; or as a namespace declaration in the form namespace foo {, then we know that this is not an identifer, can backtrack to the start of the next route, where we will find that foo is the start of nested-name-specifier.

Our version of Basic, does not need a dictionary during the syntax parse, as the grammar has been tailored such that ambiguities like the C++ example above do not exist.

The Syntax Tree Structure

If we take a look at the if statement class as an example, you will be able to get some idea about how statements are defined as a class.

class if_statement : public statement {
    private:
        expr * condition;
        list<statement *> true_statement_list;
        list<statement *> false_statement_list;
    public:
        if_statement(unsigned long line_number) : statement(t_if_statement, line_number) {};
        list<base_token *>::iterator&  
            parse_statement(list<base_token *>& program, 
                            list<base_token *>::iterator& program_iterator, 
                            list<funcproc *>& call_list);
};




You can see, the class has three data members, a data member for the condition, a data member for the true statements and a data member for the false statements. This corresponds to the if grammar statement

if_stmt: ‘IF’ condition ‘THEN’ stmt_blockopt else_blockopt ‘END’ ‘IF’

The next step

The next sub-section of Part III will look at why the syntax tree is not particularly helpful to our quest of interpreting our Basic source code (most compilers go through the motions of building a syntax tree, but convert it 'on the fly' into abstract syntax tree constructs without formally deriving a complete syntax tree), and what we have to do to produce something from which we can generate pseudo code.

Writing an Interpreter - Part IIIC -->

Is This A Good Question/Topic? 1
  • +

Replies To: Writing An Interpreter - Part IIIB

#2 Aphex19  Icon User is offline

  • Born again Pastafarian.
  • member icon

Reputation: 614
  • View blog
  • Posts: 1,873
  • Joined: 02-August 09

Posted 07 March 2010 - 04:29 PM

This is some heavy reading :wacko:

Ill have to get my thinking cap on for this one, thanks loads :)

This post has been edited by Aphex19: 07 March 2010 - 04:30 PM

Was This Post Helpful? 0
  • +
  • -

Page 1 of 1