4 Replies - 48110 Views - Last Post: 02 March 2012 - 04:08 PM Rate Topic: ***** 1 Votes

#1 blackcompe  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 1155
  • View blog
  • Posts: 2,533
  • Joined: 05-May 05

An Introduction to Compiler Design - Part II - Parsing

Post icon  Posted 01 March 2012 - 03:56 AM

*
POPULAR

An Introduction to Compiler Design - Part II - Parsing



A. Introduction

Welcome to part II of the compiler design tutorial! Parsing is a pretty complex subject, so the examples I will present are basic but good enough for you to see how you can extend them to support a progamming language. We'll parse a conditional statement (if/then), arithmetic statements, and a few others.

After the source code has been tokenized, the parsing phase commences. At the end of this stage, if the source code is syntactically valid, the compiler has generated either (1) an abstract syntax tree (AST) or (2) a syntax-directed translation (SDT) of the source code. These are known as immediate representations (IR). They are forms that can be optimized easily and later translated into assembler code. An AST is a simpler form of the source code that represents the syntactical structure of the code, and SDT is an assembler-like language that closely represents the actual layout of runtime instructions. SDT sort of skips the syntax tree all together. I'll cover SDT in part III.

B. Grammar and Language

Grammar provides a precise way to specify the syntax (structure or arrangement of composing units) of a language. In grade school we take grammar lessons that teach us to speak and write proper English. They teach us the correct way to form sentences with subjects, predicates, noun phrases, verb phrases, etc.

Subjects, predicates, and phrases are some of the composing units of a sentence in English; similarly, if/else statements, assignment statements, and function definitions are some of the composing units of source code, which itself is a single sentence of a particular programming language. There are a very large number of valid English sentences one could compose; likewise, there are a large (probably infinite) number of valid source code programs one could create.

If someone says "on the computer she is," we immediately recognize that the sentence is ill-formed. It's structure is invalid, because the noun phrase should proceed the verb phrase. It should be: "She is on the computer." Diagramming is used to validate the sentence, or rather to specify the syntax of it. If you take a look at that diagramming article, you'll see that the model is exactly like an AST. So it goes without saying that parsing, or more formally, “syntactical analysis," has its roots in Linguistics.

Moreover, just as in English, programming languages need to be specified in a way that allows us to verify whether a sentence of the language is valid. That's where context-free grammars (CFG) come to into play; they allow us to specify the syntax of a programming language's source code.


C. Context-Free Grammars

A context-free grammar is a set of rules that specify how sentences can be structured; this set of rules can be defined recursively. All CFGs have (1) a start symbol, (2) a set of non-terminal symbols, (3) a set of terminal symbols, and (4) a set of productions or "re-write rules." The following is a CFG that describes simple mathematical expressions that aren't aware of precedence or associativity rules.


expr → expr op expr

expr → (expr)

expr → -expr

expr → id

op → * | / | + | -

id → a | b | c


Note: The division and addition symbols should be bold.

Each line is a production (or "rule"). expr, op, id, *, /, +, -, a, b, c are all symbols. The pipe (|) indicates that each symbol can serve as the replacement for the left side of the production, but you can only choose one. When I say “left side of the production" I mean to the left of the arrow. You can see that some of the productions are recursively defined.

The start symbol is the symbol for which all sentences in the language can be derived from. It's also where we begin. Usually, the symbol on the left side of the first production is the designated start symbol, but sometimes it's specified explicitly if it's not in the first production. So expr is the start symbol.

Non-terminals are symbols that can be replaced by the right side of their productions. They are symbols that represent a set of strings that may contain other non-terminals and terminals. They are a syntactical "variable" or "category" because there are many strings they can be replaced with. Each symbol on the left side is a non-terminal. Terminals are the symbols that can't be re-written. They are the tokens, or basic units, of which strings in the language are composed of. The terminals correspond to the set of tokens returned by the lexer. The bold face symbols are terminals. Here's another simple CFG that describes a few common programming constructs.


stmt → if equality_expr then stmt else stmt

stmt → do stmtList while equality_expr

stmtList → stmt | stmt; stmtList


Here's a set of productions that describe the Integer class in part I. This CFG partially describes our notion of a Java class, not just a class named Integer.

Spoiler


D. Derivations

When we apply the productions starting from the start symbol we try to derive a sentence in the language, that is, a string of terminals. a + b * c is a sentence of the language defined by our first CFG above. Similarly, the source code we write is a sentence of some language defined by some CFG. Let's walk through a derivation of the mathematical expression a + c using the first CFG above.

  • Expand or "reduce" the start symbol, expr. This means replace the non-terminal in question with the appropriate string on the right side of its production. There are several to choose from: expr op expr, (expr), -expr, and id. Pick the one that you think will derive a + c in the least amount of steps. expr op expr seems like the right one, but why? If we look at the expression we can see it that matches up (or will match up in a few steps) with the production exactly. The terminal a is an id, which is an expr; + is an op and c 's case is the same as a's. As you can see a little intuition has to be exercised here. Note that we just did a recursive expansion. The current derivation is as follows:

    expr → expr op expr

    I will use the term "expansion" many times throughout this part of the tutorial; what I call an "expansion" is commonly referred to as a "reduction."

  • We are now working on expr op expr. Expand the first non-terminal from the left. Replace expr with the most appropriate production. Like we said above, since a is an id, we ought to use the production: expr → id. The current derivation becomes:

    expr → expr op expr → id op expr


  • We are now working on id op expr. Since id is a non-terminal we expand that instead of moving on to op. id only has one production: id → a | b | c. We can choose between the three. We choose a. The current derivation becomes:

    expr → expr op expr → id op expr → a op expr

    Because we keeping expanding the leftmost non-terminal first, we call this a leftmost derivation.

  • op is now the leftmost non-terminal, so we expand it to *, /, +, or -. We choose op → + . The current derivation becomes:

    expr → expr op expr → id op expr → a op expr → a + expr

    By now, you probably understand what's happening.

  • expr is now the leftmost non-terminal, so expand it to id.

    expr → expr op expr → id op expr → a op expr → a + expr → a + id

  • And lastly, we expand id to c. Our full derivation becomes:

    expr → expr op expr → id op expr → a op expr → a + expr → a + id → a + c


And that's it folks! We've ensured the expression is syntactically valid and is therefore a sentence of the language described by our CFG. The final string is all terminals. This must be true for all derivations of any string in the language. If you had any trouble following along, this example should help to clarify the procedure.

Suppose the expression was a + d. At step 6, when we try to expand id to d, we fail, because there's no production for that. If the expression was d + a, we would have failed at step 3. If the expression was b % a, we would have failed at step 4.

Here's the derivation for ( ( ( -a * b ) - ( c + b ) ) / c ) using the CFG above:


expr( expr )( expr / expr )( ( expr ) / expr )

( ( expr – expr ) / expr )( ( ( expr ) – ( expr ) ) / expr )

( ( ( expr * expr ) – ( expr ) ) / expr )

( ( ( expr * expr ) – ( expr + expr ) ) / expr )

( ( ( -expr * expr ) – ( expr + expr ) ) / expr )

( ( ( -id * expr ) – ( expr + expr ) ) / expr )

. . . . . .

( ( ( -id * id ) – ( id + id ) ) / id )( ( ( -a * b ) – ( c + b ) ) / c )


And, here's the derivation for our Integer class using the CFG defined in the spoiler above:


program → PUBLIC CLASS id LCURLY classBody RCURLY

PUBLIC CLASS ID LCURLY classBody RCURLY

PUBLIC CLASS ID LCURLY declaration RCURLY

PUBLIC CLASS ID LCURLY fieldDeclaration RCURLY

PUBLIC CLASS ID LCURLY type id SEMICOLON RCURLY

PUBLIC CLASS ID LCURLY INT id SEMICOLON RCURLY

PUBLIC CLASS ID LCURLY INT ID SEMICOLON RCURLY


The first ID terminal would be associated with the string Integer and second with the string value, which are returned together by the lexer.

Let's extend upon our source code CFG (see spoiler) to support simple methods with variable initialization statements. We won't support the notion of an instance variable.


program → PUBLIC CLASS id LCURLY classBody RCURLY

classBody → declarationList

declarationList → declaration declarationList | epsilon

declaration → methodDeclaration

methodDeclaration → PUBLIC STATIC VOID id LPAREN RPAREN LCURLY methodBody RCURLY

methodBody → variableInitializationList

variableInitializationList → variableIntializationStatement variableInitializationList | epsilon

variableIntializationStatement → type id ASSIGN literal SEMICOLON

literal → intLiteral | strLiteral | true | false

type → INT | BOOLEAN | STRING

intLiteral → INTLITERAL

strLiteral → STRLITERAL

id → ID


Note that epsilon means "nothing" or that there's no match available. So in the sixth production a variableInitializationList can be reduced to nothing, meaning the method didn't have any initialization statements or that we've matched several initializations and there aren't any left. INTLITERAL is a terminal that corresponds to a lexer token returned when an integer literal is matched. Examples of integer literals are -123, 0, 432434, etc. The regular expression is:

INTLITERAL = [-0-9][0-9]*



STRLITERAL is a terminal that corresponds to a lexer token returned when a string literal is matched. String literals are strings in double quotes: "one", "_two?", "\n\t three". Its regular expression is similar to the one for ID, which I talked about earlier.

Now, let's create our new source code:


public class IntegerTest { 
     public static void main() {
          int num = 20;
          boolean isNegative = false; 
     }
      
      public static void toString() {
           String formattedString = “%1$-7d";
      }
}




Let's derive it! If you pay attention to the non-bold face non-terminals you can capture the essence of the derivation quickly.


Spoiler


My point in doing this incredibly long leftmost derivation is to show you that the derivation proceeds in the same order as the execution of the program instructions. This is the first important point! We derived the first integer assignment first, not second or third. Then, we derived the boolean assignment second, and so on.

As you can see CFG is very powerful, having the ability to describe sentences of a potentially infinite size (as long as your system memory can accommodate it). I could have added 100,000 methods to IntegerTest and still validated its syntax.

E. Abstract Syntax Trees

An AST is a type of parse tree that's used in source code compilation. It's abstract because minor details in the code like semicolons and braces are omitted. Parse trees that include those details are called concrete syntax trees. Enough information is stored to preserve the meaning of the program.

Each tree node represents a grammar symbol. Every node that represents a non-terminal in the tree has a collection of child nodes that are either non-terminals or terminals. The child nodes represent the symbols that were produced from expanding the non-terminal. The leaves in the tree are always terminals and the interior nodes are always non-terminals. Terminal nodes are usually variable identifiers or literals.

During a derivation, for each non-terminal expansion that we do, if we create a tree node to represent the non-terminal and add it as a child to the node representing the non-terminal we expanded beforehand, we'll build up a tree of nodes representing the sentence structure. This is the second important point.

Here's a partial AST for IntegerTest:

Posted Image

It just so happens that a preorder tree traversal on our AST visits nodes in the same order as the derivation, so logically it follows that the traversal follows the order in which instructions are executed. This is the third and final important point. And that my friends is the way we generate assembler that preserves the exact order in which high-level instructions are written! We recurse through the AST, generating the corresponding assembler for the node type. Beautiful isn't it?

Translating assembler instructions line for line isn't difficult. No parsing is needed. This is essentially what an assembler does. It translates assembly statements to their binary counterparts. The nice thing is that the instructions execute in sequence.


0x1     li $v0, 1   
0x2     syscall




It's guaranteed that the system call instruction will be executed directly after the load immediate instruction for this program. So an assembler doesn't have to worry about maintaining the order of executing instructions. An assembler literally goes through each line and translates it to machine code.

What makes compilation more difficult is that you have to translate a program where high-level instructions jump from one method to another.


Line 0: void bar() {
Line 1:      System.out.println(1);
Line 2: } 
Line 3: void foo() {
Line 4:      System.out.println(2);
Line 5: }




I can't just go from line 0 to 5 and translate each statement into a runtime instruction. foo() may never even be called in the program. In short, there isn't a direct translation like with assembler. In assembler, when you see a load immediate instruction you just translate it to 010101011010101010 or whatever the ISA calls for.

When you generate assembler from a node in the AST for a method call, there's a lot more to do. You have to save the state of your registers (save the local variables), plop arguments onto the stack (send the method parameters), set stack and frame pointers (don't destroy another method's state!), check the symbol table (how much stack space do I need?), etc.

The other nice thing about ASTs is that they can be represented in computer programs. It's just a regular ol' node-based tree. As the parser runs through its parse algorithm, it builds a tree with nodes that are specialized. The visitor pattern is very useful when working with ASTs because it allows the compiler to be extensible (supporting more CPU architectures). An example of AST nodes:


class ProgramNode {
    ClassBodyNode classBody;
    
    ProgramNode(ClassBodyNode classBody) {
        this.classBody = classBody;
    }
    
    void generateMIPSAssembler(FileWriter wtr) {
        wtr.append(".text\n");
        classBody.generateMIPSAssembler(wtr);
    }
}

class ClassBodyNode {
    DeclarationListNode declList;
    
    ClassBodyNode(DeclarationListNode declList) {
        this.declList = declList;
    }
    
    void generateMIPSAssembler(FileWriter wtr) {
        declList.generateMIPSAssembler(wtr);
    }
}




F. Regular Expressions vs. Context-Free Grammars

CFG can describe all regular sets, but it's overkill for regular expressions. Regex is a concise and powerful notation for describing patterns. Here's an example of Regex vs. CFG.

Our regular set:

S = {"ab-", "ab-dd", "ac-", "ac-ddd", . . .}

Regex pattern:

a(b|c)-d*

CFG


S → aA

A → ( b-B ) | ( c-B )

B → dB | e


e – is the empty string

Do the derivation if you want, it works.

The process involved in creating a parse table is quite complex; it's inefficient to use CFG when an easier array-based regex solution exists. Regex implementations are much, much easier to construct than parsers, as we're about to see. I'm totally disregarding advanced regex features like lookahead, reluctant quantifiers, capturing groups, etc.; but even with those added, parsers are still more complex. Allowing regex to handle lexing allows the compiler to be modular, where front-ends are interchangeable.

G. Parsing Methods and Implementations

We saw how lexers were built programmatically in the last section, and in the future parts of this tutorial we'll see how to programmatically generate assembler. In this part we saw how to define a CFG for the beginnings of a programming language and how to derive a given sentence while building an AST for it. However, all of what we've learned about parsing is useless if we can't do it programmatically. There are many parsing methods, each of which require a fair bit of detailed explanation, so we're only going to hit the tip of the iceberg, mainly focusing on top-down parsing.

There are two types of parsing methods: top-down and bottom-up. Everything I've showed you up to this point is an example of top-down parsing. "Top-down" is pretty much self-explanatory. From left to right, we drill down through each non-terminal until we get to a terminal. We also build our tree from the root node down to the leaves in a top-down fashion. It's important to note that we drill down from left to right replacing the leftmost non-terminal first. The definitive meaning of top-down parsing is “an attempt to find a leftmost derivation." In bottom-up parsing we are doing a rightmost derivation, where we replace the rightmost non-terminal first.

Ambiguity

Ambiguous grammars are those in which a string of the language has more than one parse tree. This is problematic because it may be hard to interpret the intended meaning of the string. Here's an example from Wikipedia's entry on ambiguous grammars.


x * y;


That C statement can be interpreted as the multiplication of two variables, x and y, or as the declaration of a variable y whose type is a pointer to x. To resolve the conflict the compiler must locate y's type information in the symbol table. If it's a numerical type the statement is interpreted as an expression.

Generally speaking, ambiguity is an unwanted feature of any grammar and may pose a threat to the correctness of both top-down and bottom-up parsers. Different parsers handle it with varying efficacy. In spite of all this, ambiguity isn't always a problem. It's possible to generate a non-ambiguous language from an ambiguous grammar. Even if there are two parse trees that generate a string, as long as it has one intended meaning there's no problem. Some parser generators allow specifying precedence and associativity rules to remove any ambiguity.

Bottom-Up Parsing

In bottom-up parsing the derivation starts from the string of terminals (our sentence) . We try to derive the start symbol of our CFG. It's essentially a top-down derivation backwards. Initially, instead of replacing a non-terminal with another non-terminal or terminal (drilling down), we replace a terminal with non-terminal (drilling up). At certain points we may even replace several non-terminals with one non-terminal. Since the derivation is the exact reverse of a leftmost derivation, we are then replacing non-terminals from right to left (a rightmost derivation). When we make a replacement we create a node that becomes the parent of some other node instead of its child.

Top-Down Parsing

There are several problems with top-down parsing. (1) Left-recursion can lead to infinite parsing loops, so it must be eliminated. Left recursion in a CFG production occurs when the non-terminal on the left side appears first on the right side of the arrow. For some bad input, we might find ourself continuously expanding the same non-terminal. At the beginning of this tutorial we had


expr → expr op expr

expr → (expr)


That's left recursion. There are simple algorithms to remove it, but the CFG becomes twice as long in many cases. (2) Top-down parsing may involve backtracking. Backtracking is the act of climbing back up the derivation (the parse), reversing everything you've done to try another derivation path. We end up re-scanning the input as well. If you're inserting information into a symbol table (explained later) as the parse proceeds, everything has to be removed. That's pretty costly in a 10 million line application. The need for backtracking can be eliminated by parsing with lookahead. Note that backtracking isn't restricted to top-down parsers. There are backtracking LR parsers as well. Finally, (3) the order in which we choose non-terminal expansions can cause valid inputs to be rejected without information as to why.

Types of Top-Down Parsers

There are two types of top-down parsers: recursive-descent parsers and predictive parsers.

A recursive-descent parser consists of a set of functions that construct a leftmost derivation of the input. The functions don't actually have to be recursive, but they can be. Each function implements a grammar rule. RD parsers may need to implement backtracking.

Below is an example of a grammar that requires backtracking.


S → cAd

A → ab | a


The input is cad.

After matching c, when we go to expand A and we have a as the next character in the input. We don't know exactly whether the alternative ab or a will be chosen without looking at the next character in the input after a, which is d. We have to look ahead two characters in the input to know the right alternative to choose, but we only have access to a. We choose the first alternative expansion, ab. We match a in the alternative and go the next input token, d. We then try to match b in the alternative, but the next token is d, so we have to back track and choose a as the alternative. To implement backtracking in the code we have to save a pointer to a previous place in the input before we try an expansion and possibly remove information we have saved. It's costly when your deep into a parse of many lines of source code.

A predictive parser is a recursive-descent parser that needs no backtracking. It can predict the derivation path it'll take by specifying a certain amount of lookahead. Lookahead is the number of tokens in the input the parser needs to examine to decide which non-terminal expansion to take. This type of parser only works for LL(k) grammars, where k is the amount of lookahead the parser needs to make its decision. Lookahead is looking at a few more input items to determine the correct path to take. An LL(k) grammar must not be ambiguous or contain left recursion. This is a major restriction for compiler designers, so LR parsers are preferred.

An LL parser is a predictive parser that reads the input from left to right and constructs a leftmost derivation on LL(k) grammars.

Lets go back to one of the CFGs from before.


program → PUBLIC CLASS id LCURLY classBody RCURLY

classBody → declaration

declaration → fieldDeclaration | methodDeclaration

fieldDeclaration → type id SEMICOLON

methodDeclaration → PUBLIC VOID id LPAREN RPAREN SEMICOLON

type → INT | BOOLEAN | STRING

id → ID


This grammar only allows for one class declaration, so you can either have a member field or a member function signature, but not both. Below is a code snippet for a recursive-descent LL parser for the above grammar. It directly mimics the CFG. The code is an LL(1) parser since we only use the next token in the input to make our expansion decisions. As you can see we don't need to save a pointer to a place in the input or logic to backtrack and try another path through the derivation.

I've left out code to build the tree, but I've commented where you'd insert the tree-building statements.
The input is formatted for easy tokenization to keep things simple.

Spoiler


Easy enough. In parts IV and VI I will add in the tree building statements, so that we have an AST to experiment with. You'll be able to implement your own semantic analysis and code generation from the tree.

It's impractical to think this is an effective way to parse sentences for all grammars. The parsing logic is hard coded. We need something dynamic that we can load with parsing logic. We need a table-based parser.

Table-Based Parsing

A more flexible way of implementing a parser is to make it table-based, where we introduce a parsing table (a 2d array) and a stack. Each entry in the parse table represents a reduce action, and each entry on the stack is a symbol.

Here's a simple run-down of a table-based LL(1) parsing algorithm:

The bottom of the stack starts with $ and the grammar's start symbol is on top of that. Look at the symbol on the top of stack, we'll call that X, and then look at the current token of the input, we'll call that a. If X is a terminal and it equals a, pop X off the stack and remove a from the input. If it doesn't equal a, an occur has occurred. If X is a non-terminal, retrieve the reduction from the table and swap it with X. Table(X, a) returns the string that X is reduced to when a is the next token in the input. That string is one of the alternatives on the right side of the production. The string of symbols are actually thrown on the stack backwards since we're doing a leftmost derivation. If there's no entry in the table for X and a , an error has occurred. Repeat the steps until X equals $, where there are no more stack symbols to be processed. That's literally the entire algorithm. If that example was difficult to understand, here's another example showing how the algorithm works.

So how do we compute the parse table? You have to construct the first and follow sets. The input to these algorithms are a CFG. You can find the algorithms here. They aren't too complicated.

The important thing to know is that the parsing algorithm utilizes a table that's dynamically generated from the output of the first and follow sets, which can be programmatically generated from a grammar specification. No more hard-coding your grammar rules! And no more having to write code to support more language syntax! This gives rise to what we call the parser generator. This is the sane way to create a complex parser.

Types of Bottom-Up Parsers

One form of bottom-up parsing is called shift-reduce parsing. This method of parsing uses a table of actions and a stack of symbols. The Wikipedia entry has an easy-to-understand example. It's a relatively simple algorithm that only supports a small class of grammars. The major disadvantage with shift-reduce parsers is that they don't support associativity or precedence rules, nor do they handle ambiguous grammars well. They are used to describe operator grammars and are commonly used to parse mathematical expressions.Operator-precedence parsing is a bottom-up shift-reduce parser that supports defining precedence rules for grammars. Perl 6 and GCC parsers use some form operator-precedence parsing for optimization purposes.

LR Parsing

LR parsers are a class of parsers that scan the input from left to right, but constructs a rightmost derivation. It's a bottom-up parser because it constructs a rightmost derivation instead of a leftmost derivation like the top-down parsers. LR(k) parsers require k lookahead. They are deterministic.

LR parsers have some advantages over their counterparts:

  • They can handle virtually all CFG grammars (including left recursive ones), giving them the ability to handle many more languages than LL parsers.
  • They have better error reporting than backtracking parsers.


LR parsers consist of a stack of symbols, an input buffer, and a parser driver but also have a few more components. The parse table for an LR parser is filled with states. There is also an action and a goto table. I said I wouldn't go to deeply into the details of bottom-up parsers, because it would be very time consuming, so I'll just give a high-level overview.

There are three general types of LR parsers: SLR parsers, Canonical LR parsers, and LALR parsers. SLR parsers are the simplest to implement, but fail to produce tables for certain grammars. Canonical LR parsers are the most powerful of the three but are also the hardest to implement. They cover the widest range of grammars. Lastly, LALR parsers cover the middle ground: they are between SLR and Canonical parsers in complexity and coverage but can handle grammars for most programming languages.

GLR parsers are an extension of LR parsers that use a form of backtracking to handle ambiguous grammars. They work by forking sub-parsers at any point in the parse where they can't decide which transition to take. If any match in the sub-path fails, that sub-parse is thrown away and the next sub-parse is tried. If several sub-parses pass, the parser accepts both. In this sense, backtracking takes on a more general meaning: building up candidates as possible solutions and discarding them as soon as they fail.

It's worth mentioning that there are backtracking LR parsers as well.

Backtracking parsers go through greater troubles to give the exact location of a syntax error, because they have to try many candidates before they're absolutely sure that the string isn't valid. The parser may have to record each possible syntax error and include logic to figure out the correct one upon failure. When deterministic parsers fail to match a token, the process as a whole fails at that point and the location of the failure is known at that point.

Unlike the table-based top-down parser, an LR parser uses a table of states rather than productions. The LR parsing algorithm jumps from state to state. It's essentially a deterministic finite automaton. The algorithm is fairly simple, but constructing the parse tables is not.

To construct an LR parse table you have to turn the grammar into a DFA. From the DFA, you construct the parse table. The DFA is able to recognize prefixes in the grammar, and that allows the parser to determine what actions it needs to take without having to scan down the stack to make decisions. This is the “power" of LR parsers. To create the DFA you must perform the canonical sets-of-items algorithm, which depends on 3 other algorithms: augmenting grammars, closure, and goto. The inputs to these algorithms are a CFG. The algorithms are somewhat complicated. They are certainly harder to understand than those of a table-based predicative parser. Here's a overview of LR parse tables.

LR parsers are hard to implement by hand because the sets-of-items construction will result in LR tables with several thousand states. The parse table for a typical programming language will have 20,000 entries; that's 20 KB! It gets messy quickly, even for very small grammars, so parser generators are normally used to create the parse tables. Generators are implementations of the aforementioned table-construction algorithms, which of course can do things much more efficiently than humans.

Saving space is particularly important in constructing parse tables because many table entries (array elements) will be empty. One might use linked lists instead. There are also many duplicate rows that can be eliminated with the use of pointers, where you could compress many rows into one long row whose elements contain the index of the duplicate row.

Parser Generators

Parser generators are fantastic, and they're easy to use! You just feed them a specification file with a bunch of CFG rules, and it spits out a parser in your language of choice. Usually you can embed code into the specification to specify how you want the parser to build your AST. There are also ways to specify the precedence and associativity rules. And furthermore, they can usually handle ambiguous grammars.

Yacc was one of the earlier parser generators developed and was the default generator on UNIX systems. Two popular generators based off Yacc are GNU Bison and CUP. Both produce LALR parsers. GNU Bison is probably the most popular generator today. It can produce parsers in C, C++, and Java code. It's been used to create YARV (Ruby interpreter), Zend parser (PHP scripting engine), Bash Shell, and GCC (initially). ANTLR is another generator and probably is the most popular of them all. The ANTLR package has some really nice tools that make easing into compiler construction a lot more comfortable.

CUP, formerly known as JCUP, is a parser generator for Java. I found it particularly easy to use and I would recommend it to anyone looking to build a Java parser or compiler. I had planned on showing a few examples of CFG for a simple language, but I ended up writing a post that goes somewhat in depth on how to do so using CUP. All the CFGs I wrote above can be translated into a CUP spec with minimal effort.

Bison and CUP require a lexer to produce a parser, usually Flex and JFlex, respectively. I covered how to use lexer generators in part I. The JFlex distribution has a few really good examples for generating lexers and parsers.

H. Conclusion

Luckily, we don't have to concern ourselves at all with how parsers are built. We focus only on specifying the CFG for our language and using a generator to create our parser. Creating an AST is an elegant way to represent the structure of source code that leaves the compiler open to optimization and code generation. To recap, we've covered the following important points:

  • A derivation proceeds in the same order as the execution of the program instructions.
  • During a derivation, for each non-terminal expansion that we do, if we create a tree node to represent the non-terminal and add it as a child to the node representing the non-terminal we expanded beforehand, we'll build up a tree of nodes representing the sentence structure.
  • A preorder tree traversal on our AST visits nodes in the same order as the derivation, so logically it follows that the traversal follows the order in which instructions are executed.
  • A table-based parsing algorithm utilizes a table that's dynamically generated from the output of the first and follow sets, which can be programmatically generated from a grammar specification.
  • Parser generators are implementations of the table-construction algorithms.


If your interested, I found this awesome set of lecture notes for the dragon book. In part III, we will take a short look at the wonderful world of syntax directed translation.

Sources

Alfred V. Aho, Jeffrey D. Ullman. Principles of Compiler Design. Reading: Addison-Wesley, 1977.

Wikipedia: The Free Encyclopedia. Wikimedia Foundation, Inc. 22 July 2004. Web. 10 Aug. 2004.


This post has been edited by blackcompe: 20 March 2012 - 03:41 PM


Is This A Good Question/Topic? 7
  • +

Replies To: An Introduction to Compiler Design - Part II - Parsing

#2 ishkabible  Icon User is offline

  • spelling expret
  • member icon




Reputation: 1622
  • View blog
  • Posts: 5,709
  • Joined: 03-August 09

Re: An Introduction to Compiler Design - Part II - Parsing

Posted 01 March 2012 - 04:18 PM

overall it was really good! I do have a few criticisms:

  • there are many more things a parser can produce than just ASTs or translations. they are however by far the most common.
  • ASTs are not IRs. IRs are languages unto themselves. frequently ASTs are used to get to an IR but sometimes you go directly to an IR
  • in your issues with top-down parsing it seems worth mentioning that ambiguity causes all of those except for left-recursion(which can be eliminated via various methods and still maintain a top-down parser without modification to the grammar) ambiguity effects bottom-up just as much as top-down. the only real difference between top-down and bottom-up is the way the tree(or other output medium) is constructed. top-down does imply a certain difficulty with left-recursion however.
  • were does the idea that backtracking is unique to top-down parsing come from? backtracking is a method of dealing with grammars that require more look ahead that was is used. both are able to use this or not. for instance, check out backtracking LR parsers; I'm pretty sure they can handle all CFGs unlike vanilla LR parsers(e.g. equivalent to GLR parsers)
  • you say "They can give information as to where syntax errors occur." in reference to bottom-up parsers. top-down can do that just fine.
  • "There are three types of LR parsers: SLR parsers, Canonical LR parsers, and LALR parsers" there are more than that. GLR and backtracking LR just off the top of my head.

This post has been edited by ishkabible: 01 March 2012 - 04:25 PM

Was This Post Helpful? 1
  • +
  • -

#3 blackcompe  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 1155
  • View blog
  • Posts: 2,533
  • Joined: 05-May 05

Re: An Introduction to Compiler Design - Part II - Parsing

Posted 01 March 2012 - 07:37 PM

Thanks for reading it. I value your input.

Quote

ASTs are not IRs. IRs are languages unto themselves. frequently ASTs are used to get to an IR but sometimes you go directly to an IR


By definition an IR is not necessarily a language. This and this (pg. 2) seem to agree.

Your other points are valid, and I'll be sure to integrate them into the tutorial.

This post has been edited by blackcompe: 02 March 2012 - 04:07 PM

Was This Post Helpful? 1
  • +
  • -

#4 ishkabible  Icon User is offline

  • spelling expret
  • member icon




Reputation: 1622
  • View blog
  • Posts: 5,709
  • Joined: 03-August 09

Re: An Introduction to Compiler Design - Part II - Parsing

Posted 01 March 2012 - 08:27 PM

Thinking about it a bit further I would agree. Cmm, Haskell Core, and STG-code are used in GHC but are never handled in actual text. instead ASTs of the languages are used so I'd say your right about that; I under thought that one.
Was This Post Helpful? 0
  • +
  • -

#5 blackcompe  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 1155
  • View blog
  • Posts: 2,533
  • Joined: 05-May 05

Re: An Introduction to Compiler Design - Part II - Parsing

Posted 02 March 2012 - 04:08 PM

Updated
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1