     ## Tokenizer : Polynomials Leave Comment

```poly -> term | term + poly | term - poly
term -> coef | coef x | coef x ^ DIGIT DIGIT*
coef -> INT | REAL
INT -> - DIGIT DIGIT* | DIGIT DIGIT*
REAL -> - DIGIT DIGIT* . DIGIT DIGIT* | DIGIT DIGIT* . DIGIT DIGIT*
DIGIT-> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

```

```+ = PLUS
- = MINUS
^ = POWER
. = DOT
x = X
INT
REAL
EOF = end of file

```

The distinction is important on INT and REAL because we want integer exponents.

Let's consider some examples:

```3x^4 = INT X POWER INT EOF
2.5x^12 = REAL X POWER INT EOF
8x + 5 = INT X PLUS INT EOF
-2x^5 - 14 = MINUS INT X POWER INT MINUS INT EOF

```

You get the idea.

We'll have an object which can return the next token in the stream provided.

An example run might be:

```\$ poly.exe
9x^4 + 14.5

```

Given input is "9x^4 + 14.5"

```get_token() reads 9 and token is INT. Then reads x, and there is no matching token, so put x back, and return INT.
get_token() reads x and token is X. Then reads ^, and there is no matching token, so put ^ back, and return X.
get_token() reads ^ and token is POWER. Then reads 4, and there is no matching token, so put 4 back, and return POWER.
get_token() reads 4 and token is INT. Then reads +, and there is no matching token, so put + back and return INT.
get_token() reads + and token is PLUS. Then reads 1, no match, put back, return PLUS.
get_token() reads 1 and token is INT. Then reads 4 and token is INT. Then reads DOT and token is REAL. Then reads 5 and token is REAL. Then reads EOF, and no match so put EOF back and return REAL with value (lexeme) 14.5.
get_token() reads EOF, and returns EOF.

Final stream: INT X POWER INT PLUS REAL EOF

```

Seems to work fine.

So, the idea at play here is finding the longest matching substring. Our concern is with this portion of the grammar:

```INT -> - DIGIT DIGIT* | DIGIT DIGIT*
REAL -> - DIGIT DIGIT* . DIGIT DIGIT* | DIGIT DIGIT* . DIGIT DIGIT*
DIGIT-> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

```

I may not have mentioned this before, but the * is kleene-star, which is matching zero or more occurrences. For more information: https://en.wikipedia...iki/Kleene_star

We get append next character to answer until next is not DIGIT or DOT. If DOT was encountered, token is REAL, and append next character until next is not DIGIT. return (token_type, lexeme).

Alright. It's time to move into some code:

```class Input
{
private:
string buffer;

public:
Input(string data)
{
cout << "data=" << data << endl;
for(char c : data)
{
if(c != ' ')
buffer += c;
}
}

char pop()
{
if(buffer.length() == 0)
return '\0';

char c = buffer;
buffer = buffer.substr(1);
return c;
}

void push(char c)
{
buffer = c + buffer;
}

string get_buffer()
{
return buffer;
}
};

int main()
{
string line;
getline(cin, line);
Input input(line);
cout << "input=" << input.get_buffer() << endl;
while(input.get_buffer() != "")
{
cout << input.pop() << endl;
}

return 0;
}

```

Here's a sample run:

```17x^2 + 12x - 17
data=17x^2 + 12x - 17
input=17x^2+12x-17
1
7
x
^
2
+
1
2
x
-
1
7

```

So, now that input is handled, we can start to tokenize. We'll apply the logic to a Lexer class now:

```#include <iostream>
#include <cctype>

using namespace std;

enum TokenType { NONE, PLUS, MINUS, POWER, X, INT, REAL, END_OF_FILE };

struct Token
{
TokenType type;
string lexeme;
};

class Input
{
private:
string buffer;

public:
Input(string data)
{
cout << "data=" << data << endl;
for(char c : data)
{
if(c != ' ')
buffer += c;
}
}

char pop()
{
if(buffer.length() == 0)
return '\0';

char c = buffer;
buffer = buffer.substr(1);
return c;
}

void push(char c)
{
buffer = c + buffer;
}

string get_buffer()
{
return buffer;
}
};

class Lexer
{
private:
Input *input;

public:
Lexer(string data)
{
input = new Input(data);
}

Token get_token()
{
Token token;
token.type = NONE;
token.lexeme = "";

char c = input->pop();
switch(c)
{
case '+':
token.type = PLUS;
token.lexeme = "+";
break;
case '-':
token.type = MINUS;
token.lexeme = '-';
break;
case '^':
token.type = POWER;
token.lexeme = '^';
break;
case 'x':
token.type = X;
token.lexeme = "x";
break;
case '\0':
token.type = END_OF_FILE;
token.lexeme = "";
break;
default:
if(isdigit(c) || c == '.')
{
token.type = INT;
token.lexeme += c;
bool has_dot = (c == '.');
token.type = has_dot ? REAL : INT;

while(isdigit(c) || c == '.')
{
c = input->pop();
if(c == '.' && !has_dot)
{
token.type = REAL;
token.lexeme += c;
has_dot = true;

}
else if(c == '.')
{
cout << "ERROR" << endl;
exit(1);
}
else
{
if(isdigit(c))
token.lexeme += c;
}
}

input->push(c);
}
else
{
cout << "ERROR" << endl;
exit(1);
}
break;
}
}

};

int main()
{
string names[] = { "NONE", "PLUS", "MINUS", "POWER", "X", "INT", "REAL", "EOF" };
string line;
getline(cin, line);
Lexer *lex = new Lexer(line);
Token t;
while((t = lex->get_token()).type != END_OF_FILE)
cout << names[t.type] << ", " << t.lexeme << endl;

return 0;
}

```

Here are some sample outputs:

```data=13x^5 + 17x^2 - 0.5
INT, 13
X, x
POWER, ^
INT, 5
PLUS, +
INT, 17
X, x
POWER, ^
INT, 2
MINUS, -
REAL, 0.5

```

```data=-7x ^ 3+ -4x^2 + 2
MINUS, -
INT, 7
X, x
POWER, ^
INT, 3
PLUS, +
MINUS, -
INT, 4
X, x
POWER, ^
INT, 2
PLUS, +
INT, 2

```

```data=3.14.15
ERROR

```

```data=12y + 3
INT, 12
ERROR

```

So, that seems to work. We'll move onto parsing next.

S M T W T F S
1234567
891011121314
15161718192021
22232425 26 2728
2930

### Recent Entries

• • • • • • • • • • ### 1 user(s) viewing

1 Guests
0 member(s)
0 anonymous member(s)