Subscribe to MentalFloss Minutes        RSS Feed
-----

Tokenizer : Polynomials

Icon Leave Comment
Recall last post: [link]

poly -> term | term + poly | term - poly
term -> coef | coef x | coef x ^ DIGIT DIGIT*
coef -> INT | REAL
INT -> - DIGIT DIGIT* | DIGIT DIGIT*
REAL -> - DIGIT DIGIT* . DIGIT DIGIT* | DIGIT DIGIT* . DIGIT DIGIT*
DIGIT-> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9



Let's think about tokens.

+ = PLUS
- = MINUS
^ = POWER
. = DOT
x = X
INT
REAL
EOF = end of file



The distinction is important on INT and REAL because we want integer exponents.

Let's consider some examples:

3x^4 = INT X POWER INT EOF
2.5x^12 = REAL X POWER INT EOF
8x + 5 = INT X PLUS INT EOF
-2x^5 - 14 = MINUS INT X POWER INT MINUS INT EOF



You get the idea.

We'll have an object which can return the next token in the stream provided.

An example run might be:

$ poly.exe 
9x^4 + 14.5 



Given input is "9x^4 + 14.5"

get_token() reads 9 and token is INT. Then reads x, and there is no matching token, so put x back, and return INT.
get_token() reads x and token is X. Then reads ^, and there is no matching token, so put ^ back, and return X.
get_token() reads ^ and token is POWER. Then reads 4, and there is no matching token, so put 4 back, and return POWER.
get_token() reads 4 and token is INT. Then reads +, and there is no matching token, so put + back and return INT.
get_token() reads + and token is PLUS. Then reads 1, no match, put back, return PLUS.
get_token() reads 1 and token is INT. Then reads 4 and token is INT. Then reads DOT and token is REAL. Then reads 5 and token is REAL. Then reads EOF, and no match so put EOF back and return REAL with value (lexeme) 14.5.
get_token() reads EOF, and returns EOF.
get_token() reads nothing. returns EOF.

Final stream: INT X POWER INT PLUS REAL EOF



Seems to work fine.

So, the idea at play here is finding the longest matching substring. Our concern is with this portion of the grammar:

INT -> - DIGIT DIGIT* | DIGIT DIGIT*
REAL -> - DIGIT DIGIT* . DIGIT DIGIT* | DIGIT DIGIT* . DIGIT DIGIT*
DIGIT-> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9



I may not have mentioned this before, but the * is kleene-star, which is matching zero or more occurrences. For more information: https://en.wikipedia...iki/Kleene_star

We get append next character to answer until next is not DIGIT or DOT. If DOT was encountered, token is REAL, and append next character until next is not DIGIT. return (token_type, lexeme).

Alright. It's time to move into some code:

class Input
{
	private: 
	string buffer;
	
	public:
	Input(string data)
	{
		cout << "data=" << data << endl;
		for(char c : data)
		{
			if(c != ' ')
				buffer += c;
		}
	}
	
	char pop()
	{
		if(buffer.length() == 0)
			return '\0';
	
		char c = buffer[0];
		buffer = buffer.substr(1);
		return c;
	}
	
	void push(char c)
	{
		buffer = c + buffer;
	}
	
	string get_buffer()
	{
		return buffer;
	}
};

int main()
{
	string line;
	getline(cin, line);
	Input input(line);
	cout << "input=" << input.get_buffer() << endl;
	while(input.get_buffer() != "")
	{
		cout << input.pop() << endl;
	}

	return 0;
}



Here's a sample run:

17x^2 + 12x - 17
data=17x^2 + 12x - 17
input=17x^2+12x-17
1
7
x
^
2
+
1
2
x
-
1
7



So, now that input is handled, we can start to tokenize. We'll apply the logic to a Lexer class now:

#include <iostream>
#include <cctype>

using namespace std;

enum TokenType { NONE, PLUS, MINUS, POWER, X, INT, REAL, END_OF_FILE };

struct Token
{
	TokenType type;
	string lexeme;
};

class Input
{
	private: 
	string buffer;
	
	public:
	Input(string data)
	{
		cout << "data=" << data << endl;
		for(char c : data)
		{
			if(c != ' ')
				buffer += c;
		}
	}
	
	char pop()
	{
		if(buffer.length() == 0)
			return '\0';
	
		char c = buffer[0];
		buffer = buffer.substr(1);
		return c;
	}
	
	void push(char c)
	{
		buffer = c + buffer;
	}
	
	string get_buffer()
	{
		return buffer;
	}
};

class Lexer
{
	private:
	Input *input;
	
	public:
	Lexer(string data)
	{
		input = new Input(data);
	}
	
	Token get_token()
	{
		Token token;
		token.type = NONE;
		token.lexeme = "";
		
		char c = input->pop();
		switch(c)
		{
			case '+':
				token.type = PLUS;
				token.lexeme = "+";
				break;
			case '-':
				token.type = MINUS;
				token.lexeme = '-';
				break;
			case '^':
				token.type = POWER;
				token.lexeme = '^';
				break;
			case 'x':
				token.type = X;
				token.lexeme = "x";
				break;
			case '\0':
				token.type = END_OF_FILE;
				token.lexeme = "";
				break;
			default:
				if(isdigit(c) || c == '.')
				{
					token.type = INT;
					token.lexeme += c;
					bool has_dot = (c == '.');
					token.type = has_dot ? REAL : INT;
					
					while(isdigit(c) || c == '.')
					{
						c = input->pop();
						if(c == '.' && !has_dot)
						{
							token.type = REAL;
							token.lexeme += c;
							has_dot = true;

						}
						else if(c == '.')
						{
							cout << "ERROR" << endl;
							exit(1);
						}
						else
						{
							if(isdigit(c))
								token.lexeme += c;
						}
					}
					
					input->push(c);
				}
				else
				{
					cout << "ERROR" << endl;
					exit(1);
				}
				break;
		}
		return token;
	}
	
};

int main()
{
	string names[] = { "NONE", "PLUS", "MINUS", "POWER", "X", "INT", "REAL", "EOF" };
	string line;
	getline(cin, line);
	Lexer *lex = new Lexer(line);
	Token t;
	while((t = lex->get_token()).type != END_OF_FILE)
		cout << names[t.type] << ", " << t.lexeme << endl;


	return 0;
}



Here are some sample outputs:

data=13x^5 + 17x^2 - 0.5
INT, 13
X, x
POWER, ^
INT, 5
PLUS, +
INT, 17
X, x
POWER, ^
INT, 2
MINUS, -
REAL, 0.5



data=-7x ^ 3+ -4x^2 + 2
MINUS, -
INT, 7
X, x
POWER, ^
INT, 3
PLUS, +
MINUS, -
INT, 4
X, x
POWER, ^
INT, 2
PLUS, +
INT, 2



data=3.14.15
ERROR



data=12y + 3
INT, 12
ERROR



So, that seems to work. We'll move onto parsing next.

0 Comments On This Entry

 

December 2018

S M T W T F S
      1
2345678
9101112131415
1617 18 19202122
23242526272829
3031     

Tags

    Recent Entries

    Recent Comments

    Search My Blog

    0 user(s) viewing

    0 Guests
    0 member(s)
    0 anonymous member(s)