6 Replies - 1004 Views - Last Post: 02 June 2016 - 04:45 PM Rate Topic: -----

#1 nullCompiler  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 2
  • Joined: 31-May 16

Let's Build A Compiler C++ Question

Posted 01 June 2016 - 02:01 PM

Hi,

I'm trying to do Jack Crenshaw's Let's Build A Compiler tutorial and am stuck on the design part. In the first part:Introduction there are multiple functions that call getChar() for example:
{--------------------------------------------------------------}
{ Match a Specific Input Character }

procedure Match(x: char);
begin
   if Look = x then GetChar
   else Expected('''' + x + '''');
end;


And
{--------------------------------------------------------------}
{ Get an Identifier }

function GetName: char;
begin
   if not IsAlpha(Look) then Expected('Name');
   GetName := UpCase(Look);
   GetChar;
end;


And
{--------------------------------------------------------------}
{ Get a Number }

function GetNum: char;
begin
   if not IsDigit(Look) then Expected('Integer');
   GetNum := Look;
   GetChar;
end;


And
{--------------------------------------------------------------}
{ Initialize }

procedure Init;
begin
   GetChar;
end;


Later on he actually starts rewriting it so it's OOP:
{--------------------------------------------------------------}
unit Input;
{--------------------------------------------------------------}
interface
var Look: char;              	{ Lookahead character }
procedure GetChar;            { Read new character  }

{--------------------------------------------------------------}
implementation

{--------------------------------------------------------------}
{ Read New Character From Input Stream }

procedure GetChar;
begin
	Read(Look);
end;

{--------------------------------------------------------------}
{ Unit Initialization }
begin
	GetChar;
end.
{--------------------------------------------------------------}


However It later uses the class in a global way:
{--------------------------------------------------------------}
{ Match One Character }

procedure Match(x: char);
begin
	if Look = x then GetChar
	else Expected('''' + x + '''');
end;


And
{--------------------------------------------------------------}
{ Get an Identifier }

function GetName: char;
begin
	if not IsAlpha(Look) then Expected('Name');
	GetName := UpCase(Look);
	GetChar;
end;


And
{--------------------------------------------------------------}
{ Get a Number }

function GetNum: char;
begin
   if not IsDigit(Look) then Expected('Integer');
   GetNum := Look;
   GetChar;
end;


Anyways here is what i have so far: Link. Any assistance in this matter would be greatly appreciated.

Sincerely,

NullCompiler

Is This A Good Question/Topic? 0
  • +

Replies To: Let's Build A Compiler C++ Question

#2 macosxnerd101  Icon User is offline

  • Games, Graphs, and Auctions
  • member icon




Reputation: 12149
  • View blog
  • Posts: 45,169
  • Joined: 27-December 08

Re: Let's Build A Compiler C++ Question

Posted 01 June 2016 - 02:52 PM

Welcome to Dream.in.Code! :)

I'm not sure what your question is. We are happy to help with specific questions!

Also, please post your code here using code tags: :code:, rather than on Dropbox. Most folks here don't download ZIP files, particularly those hosted elsewhere.
Was This Post Helpful? 0
  • +
  • -

#3 nullCompiler  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 2
  • Joined: 31-May 16

Re: Let's Build A Compiler C++ Question

Posted 01 June 2016 - 03:54 PM

View Postmacosxnerd101, on 01 June 2016 - 02:52 PM, said:

Welcome to Dream.in.Code! :)/>

I'm not sure what your question is. We are happy to help with specific questions!

Also, please post your code here using code tags: :code:/>, rather than on Dropbox. Most folks here don't download ZIP files, particularly those hosted elsewhere.

Thanks for the welcome, sorry for posting the link to dropbox i assumed you didn't want a million
 tags in a single post so i was trying to be respective to everyone and not spam.

CError.h
#ifndef CERROR_H
#define CERROR_H

// include required libraries
#include <iostream>
#include <string>

class CError
{
private:

public:
	CError();
	void Error(std::string s);
	void Expected(std::string s);
	void Abort(std::string s);
};
#endif



CError.cpp
#include "CError.h"

/*
* class initalizer
*/
CError::CError() {}

/*
 * Report an Error
 */
void CError::Error(std::string s)
{
	std::cout << "Expected: " << s << ".";
}

/*
* Report Error and Halt
*/
void CError::Abort(std::string s)
{
	Error(s);
	exit(-1);
}

/*
 * Report What Was Expected
 */
void CError::Expected(std::string s)
{
	Abort(s + "Expected");
}



CInput.h
#ifndef CINPUT_H
#define CINPUT_H

// include required libraries
#include <iostream>

class CInput
{
private:
	char m_LookAhead;

public:
	CInput();
	void getCharacter();
};
#endif



CInput.cpp
#include "CInput.h"
/*
 * class initalizer
 */
CInput::CInput()
{
   // initalize class members

}

/*
 * Read New Character From Input Stream
 */
void CInput::getCharacter()
{
	// read a character
	std::cin >> m_LookAhead;
}



COutput.h
#ifndef COUTPUT_H
#define COUTPUT_H

// include required libraries
#include <iostream>
#include <string>

class COutput
{
private:
	const std::string TAB = "   ";

public:
	COutput();
	void emit(std::string s);
	void emitLine(std::string s);
};
#endif



COutput.cpp
#include "COutput.h"

/*
 * class initalizer
 */
COutput::COutput()
{
	// initalize class members

}

/*
 *  Emit an instruction
 */
void COutput::emit(std::string s)
{
	std::cout << TAB << s;
}
/*
 * Emit an instruction line
 */
void COutput::emitLine(std::string s)
{
	this->emit(s);
	std::cout << "\n";
}



CScanner.h
#ifndef CSCANNER_H
#define CSCANNER_H

// include required libraries
#include "CError.h"

class CScanner
{
private:

public:
	CScanner();
	bool isAlpha(char c);
	bool isDigit(char c);

	void Match(char l, char c);
	char getName(char c);
	char getNumber(char c);
};
#endif



CScanner.cpp
#include "CScanner.h"

/*
* class initalizer
*/
CScanner::CScanner() {}

/*
* returns if character is alpha
*/
bool CScanner::isAlpha(char c)
{
	return (c > 64 && c < 91);
}

/*
 * returns if character is numeric
 */
bool CScanner::isDigit(char c)
{
	return (c > 47 && c < 58);
}

/*
 * Match One Character
 */
void CScanner::Match(char l, char c)
{
	if (l == c)
	{
		// not reachable
		GetChar();
	}
	else
	{
		// local instance of error class
		CError g_error;

		// output what was repected
		g_error.Expected("" + c);
	}
}

/*
 * Get an Identifier
 */
char CScanner::getName(char c)
{
	if (!isAlpha(c))
	{
		// local instance of error class
		CError g_error;

		// output what is expected
		g_error.Expected("Name");
	}
	// not reachable
	GetChar()

	return toupper(c);
}

/*
* Get a Number
*/
char CScanner::getNumber(char c)
{
	if (!isDigit(c))
	{
		// local instance of error class
		CError g_error;

		// output what was expected
		g_error.Expected("Integer");
	}
	// not reachable
	GetChar()

	return c;
}



Main.h
#ifndef MAIN_H
#define MAIN_H

#endif



Main.cpp
int main(int argc, const char* argv[])
{
	// return success
	return 0;
}


As you can see if i translate the code from pascal to C++ exactly there are 3 functions in Scanner(match(), getName() & getNumber()) that need access to CInput->getChar(). I could try to make an instance of CInput but i believe main also needs access to it so that would essentially 2 different instances to CInput->m_LookAhead. I've been trying to figure out the best way to do this design wise but have yet to figure it out. I've only read the first part so i imagine there will be other functions that need access to CInput->getChar() as well. Any assistance in this matter would be greatly appreciated.

Sincerely,

nullCompiler
Was This Post Helpful? 0
  • +
  • -

#4 horace  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 768
  • View blog
  • Posts: 3,832
  • Joined: 25-October 06

Re: Let's Build A Compiler C++ Question

Posted 01 June 2016 - 10:15 PM

you require a system specification so that you can determine in detail what each method does - I assume can you obtain this from the documentation of the Build A Compiler tutorial
I would recommend concentrating on one class at at time implemening the code and testing it with a main(), e.g. start with CInput(), them move on to other classes
Was This Post Helpful? 0
  • +
  • -

#5 sepp2k  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 2511
  • View blog
  • Posts: 3,988
  • Joined: 21-June 11

Re: Let's Build A Compiler C++ Question

Posted 02 June 2016 - 04:28 AM

View PostnullCompiler, on 01 June 2016 - 11:01 PM, said:

Later on he actually starts rewriting it so it's OOP:


Pascal's interface and implementation have nothing to do with OOP. It's Pascal's equivalent of separating your code into header- and code-files (except it's in the same file).
Was This Post Helpful? 0
  • +
  • -

#6 #define  Icon User is offline

  • Duke of Err
  • member icon

Reputation: 1852
  • View blog
  • Posts: 6,661
  • Joined: 19-February 09

Re: Let's Build A Compiler C++ Question

Posted 02 June 2016 - 02:45 PM

Hi, you could have a reference variable in the scanner which would refer to an object in main, or a global instance. Once you have completed the pascal design translation you could look at improving or tweaking the C++ code, if necessary.
Was This Post Helpful? 0
  • +
  • -

#7 dday9  Icon User is offline

  • D.I.C Regular

Reputation: 94
  • View blog
  • Posts: 495
  • Joined: 17-April 13

Re: Let's Build A Compiler C++ Question

Posted 02 June 2016 - 04:45 PM

Hi nullCompiler,

I will say this upfront that I didn't read your code and to be honest, my C++ skills are subpar(if I'm exaggerating). However, I have been studying compiler creation for some time now and I'd be happy to break down what I know in my best pseudo-code.

You can break a compiler into two phases:
  • front-end
  • back-end

The front-end is typically consisted of three "mini" phases:
  • Lexical analysis, aka scanning
  • Syntax analysis, aka parsing
  • Semantic analysis


During the lexical analysis phase, an object oriented approach would be to create a "token" class that would contain two properties:
  • Category
  • Value


The "category" property could be an enumeration if you don't have too many different token categories and if your language will remain largely static(JSON comes to mind), however in most instances you should probably make the data type a String(or char collection in C++). The "value" property is a String and will hold the literal value, aka lexeme, of the token. The easiest way to describe a token would be to give you an example, assume that I have the following source code:
int foo = 2;

Then the tokens that make up the source code would be:
Category   | Lexeme
-------------------
keyword    | int
identifier | foo
assignment | =
number     | 2


So in order to create a lexical analyzer, what I typically do is create a function to match each category, loop through the entire source code, and during the loop check if one of the match functions returns a value. If so then I create a new token object and add it to a collection and skip the iteration by the length of the returned lexeme, otherwise if nothing was returned then an exception occurred. The following is a pseudo-code representation:
Token[] Scan(string source) {
    token[] tokens;
    string lexeme = "";

    for(x = 0;x < source.length;x = x) {
        lexeme = (From category As MatchFunction In categories Where category.Match(source.substring(x)) Select First category);
        if (lexeme != "") {
            tokens.Add(lexeme, MatchedFunction);
            x += lexeme.length;
        } else {
            throw exception("invalid character at position: " x.tostring());
        }
    }

    return tokens;
}


OK so terrible pseudo-code, but again my C++ ain't to good so my VB.Net/JQuery background is coming through. As far as the functions that match the lexemes, they themselves are relatively simple. You can either use RegEx to match more complex patterns or if you need to match literal values(like True, False, Null) then you could use simple comparison. Here is a pseudo-code example of both:
//literal
string MatchBoolean(string source) {
    string lexeme = "";

    if (source.length >= "false".length) {
        if (source.substring(0, "false".length) == "false") {
            lexeme = "false";
        }
    } elseif (source.length >= "true".length) {
        if (source.substring(0, "true".length) == "true") {
            lexeme = "true";
        } else if (source.substring(0, "null".length) == "null") {
            lexeme = "null";
        }
    }

    return lexeme;
}

//Regex
string MatchNumber(string source) {
    string lexeme = "";
    match numberMatch = Regex.Match("-?\d+(\.\d+)?([eE][+-]?\d+)?", source)

    if (numberMatch.success() && numberMatch.position == 0) {
        lexeme = numberMatch.Value;
    }

    return lexeme;
}


That basically sums up lexical analysis. The syntax analysis phase is a little more difficult and there are plenty of ways to skin the cat. I'm going to show you how to create a recursive descent parser. The basic idea of a recursive descent parser is that you have a set of functions that can(and often do) recursively call themselves. The important note is to keep in mind the end objective of a parser, and that is to transform the source code(or collection of tokens) into something meaningful all while verifying that the code is syntactically correct(eg no grammar errors). Typically this "something meaningful" is a parse tree or abstract syntax tree, but really it's up for you to decide how you want to represent the data.

A parse tree will be a build up of your tokens in a manner that reflects some type of predefined grammar. I highly suggest that you start off with a grammar set in place. It doesn't need to be perfect because you can fix it as you go, but you should always use something! The grammar notation most often used in parsers is EBNF. Basically you have a set of rules(eg nonterminals) and values(eg terminals). I can't speak for your syntax but JSON's EBNF looks like this:
value ::= object | array | string | number | Boolean
object ::= '{', [string, ':', value], '}'
array ::= '[', [value], ']'
string ::= '"([^"\\]|\\["\\\/bfnrt])*"'(* regex *)
number ::= '-?\d+(\.\d+)?([eE][+-]?\d+)?' (* regex *)
Boolean ::= 'true' | 'false' | 'null' (* case insensitive *)


Using this grammar you can construct functions to match:
  • Boolean terminals
  • Number terminals
  • String terminals
  • Array nonterminals
  • Object nonterminals
  • Value nonterminals


Since we've used Regex in the lexical analysis phase, we really don't need to have a function for Boolean, number, and string... however you could.

Basically the syntax follows, if there is a sequence(like in the array nonterminal) you would create a chain of conditional statements working your way down:
if (currentToken.value != "[") {
    throw exception("expected a [ but instead saw a " + currentToken.value);
} else {
    //consume token, build parse tree
}

if (parseValue(remainingTokens)) {
    //consume token(s), build parse tree
}

if (currentToken.value != "]") {
    throw exception("expected a ] but instead saw a " + currentToken.value);
} else {
    //consume token, build parse tree
}


Notice how I didn't throw an exception if parseValue returned a false value? This is because according to our grammar, a value is optional.

If there is alternation(like there is in value nonterminal) then use a nested conditional statement:
if (currentToken.category != "string") {
    if (currentToken.category != "number") {
        if (currentToken.category != "string") {
            if (currentToken.category != "boolean") {
                if !(parseObject(remainingTokens)) {
                    if !(parseArray(remainingTokens)) {
                        throw exception("expected a value but saw a " + currentToken.value);
                    else {
                        //consume token, build parse tree
                    }
                else {
                    //consume token, build parse tree
                }
            else {
                //consume token, build parse tree
            }
        } else {
            //consume token, build parse tree
        }
    } else {
        //consume token, build parse tree
    }
} else {
    //consume token, build parse tree
}


What you'll notice is that when you step through your code, the functions start calling themselves recursively. Sometimes they'll return a value... other times they won't. It is up to you to decide how to handle them.

In regards to the semantic analysis... I'm not sure. I haven't gotten that far yet.

However, what my suggestion would be to do is to try parsing a smaller language that you know. For me it was JSON, the syntax is incredibly small and it is also static(I doubt it will ever change!). For some, it could be brainfuck, I know I don't understand the language ;) but regardless... start off with a simple language you know before trying to create your own language.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1