Page 1 of 1

String Tokenizing how to tokenize a given string based on given delimiters using a class Rate Topic: -----

#1 Anarion  Icon User is offline

  • The Persian Coder
  • member icon

Reputation: 310
  • View blog
  • Posts: 1,513
  • Joined: 16-May 09

Post icon  Posted 11 December 2009 - 01:00 PM

Have you ever wanted to read data from a file and use it in your application? For sure your answer is yes as every programmer, at some point, needs to read data from different files, either created by user or by programmer himself; like a configuration file.

There are times that the data in file has a special meaning, not just some simple strings line by line for displaying; let's say we want to read a simple configuration file like this:
number    =    4;

As the programmer, I have a special meaning for this line in my mind, there must be a variable and it's value is going to be 4. How do we read this line and do the correct thing? The answer is, we have to separate this line by some characters like space and also = and ;.

In this tutorial, I'm going to walk you through the steps of creating a class which does the job for you, you just have to create an instance of this class in your program and give it correct values, which I'm going to explain in full details.

Things You Have To Know Already
using strings, a basic idea about what a deque is, or even vector (because they are much similar in syntax), of course OOP is needed too.

the Theory Behind this Class
It's simple, just start reading from the first character till reaching a non-special character. Note that from now on, we are going to call these special ones "Delimiters". So, we reached the first non-Delimiter character, what next? We have to read the string from this point till we reach the next delimiter. Let's bring the example again: number = 4;

Now, the first character is non-Delimiter, so there is no skipping needed, then we read till we reach the space which is one of the delimiters; So, from the start of the string till here, is a part we need to store somewhere, to have simplicity and more flexibility, I am going to use a std::deque to store these parts as we read through the string.

Now we continue the reading, we will immediately reach = which is one of the delimiters we defined. But do we have to add this zero-length string to the deque? the answer to that is a loud NOPE. So actually we have to check for zero-length strings before we add the parts to the deque.
Next comes another space character which is a delimiter and since it gives us a zero-length string as the next part, we are going to skip that too.

The next part is from the previous "space" till the last character which is ";", the result is a string with "4" in it. So we add it to the deque.

Implementing the Class
Now bear with me to walk through writing a simple program:
#include <deque>
#include <string>

class tokenizer {
        std::string delimiters; //the string which holds our desired delimiters to separate the text based on
        std::deque<std::string> tokens; //the double-ended queue which holds all the parts of the parsed text

        bool isdelimiter(char c) { //an inline private function which tells us if a given character is in the delimiters
            return delimiters.find( c ) != std::string::npos; //confused? hahaha DON'T be, it says return the result of this sentence
            //will be true when there exists a character in the string "delimiters", and will be false if the result is equal to npos, which means the "find" function did not find a match
        }
public: //now we define these member functions as public so that they can be called from outside
        tokenizer(const std::string& is, const std::string& delim) : delimiters(delim) { //call the constructor of delimiters string which assigns the delim to it
            std::string tmp; //a temporary string used to hold the parts
            std::size_t p, end = is.length(); //these are for holding location indexes of string
            for(p=0; p<end; p++) { //loop through all the characters in the string
                if( isdelimiter(is.at(p)) ) { //if this character of string is a delimiter,
                    if(tmp.length() != 0) { //if the length of the tmp string which holds the current part is not zero
                        tokens.push_back(tmp); //add this part to the deque
                        tmp = ""; //empty the string
                    }
                } else {
                    tmp += is.at(p); //the character was not a delimiter, so add it to the tmp which holds the current part
                }
             }
        }

        bool has_next() { //used to check if the deque has got empty
            return !tokens.empty(); //will return true if deque is NOT empty
        }

        std::string next_token() {
            std::string token( tokens.front() ); //assign the first string located in deque to the token string
            tokens.pop_front(); //remove the first item in the deque which we just assigned to token
            return token;
        }
};


Easy, huh? Let's put this definition in a header file so that we can easily use it in other applications. Here's a sample application which uses the above class:
#include <iostream>
#include "tokenizer.h" //this is out tokenizer class header
int main() {
	std::string in("name    = Kian;");
	tokenizer words(in, " =;"); //the delimiters are space and = and ;
	std::string word;
	while(words.has_next()) //loop till we run out of items in deque
		std::cout<<words.next_token()<<std::endl; //print the next item in deque
	return 0;
}


There are still things you can do to complete this class as you need, like adding support to read quotes (must hold the spaces inside the quotes), but these are up to you, or maybe you need support for "(" and ")"... there are much things you can do to make it more flexible.

Hope this tutorial was of use for you

Is This A Good Question/Topic? 0
  • +

Page 1 of 1