13 Replies - 2522 Views - Last Post: 19 July 2011 - 02:07 PM Rate Topic: -----

#1 doothedew  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 38
  • Joined: 09-July 11

Parsing Strings

Posted 18 July 2011 - 08:49 PM

I am parsing through HTML to build a web crawler and I have a tokenizer that returns a token that when I call the getValue method it gives me all th content of the tag in one string. I need to parse through this string and pull out any words that don't start with numbers (words that have "-" or "_" count as words). This is the code that I used but is not working. Is there a better easier way to do this? Someone suggested I use regex but I looked and the few examples that I looked at weren't explained clearly and seemed pretty difficult. Example of my output below my source code.
cout<<"//////////////////////////WORDSStart/////////////////////////////////"<<endl;
	string parserString = token.GetValue();
	ToLower(parserString);
	const char* pString =  parserString.c_str();
	char* copyString = new char[strlen(parserString.c_str()) + 1];
	strcpy(copyString, pString);
	cout<<"The original string: "<<copyString<<endl;
	strtok(copyString, " ! @ # $ % ^ & * ( ) : < > \\ / \? . ; | + = { } [ ] ` ~ \t \n \" \' ");
	while (copyString != NULL)
	{
		if(isalpha(copyString[0]))
		{
			cout<<" "<<copyString<<endl;
			words[wordCount]=copyString;
			wordCount++;
		}
		copyString = strtok (NULL, " ,.-");
	}
	cout<<"//////////////////////////WORDSEnd/////////////////////////////////"<<endl;



Program output:

The original string: how will the new salary cap shake out? can the lions' front four mesh? answers are in the mail. 
 how
 will
 the
 new
 salary
 cap
 shake
 out?
 can
 the
 lions'
 front
 four
 mesh?
 answers
 are
 in
 the
 mail

This post has been edited by doothedew: 18 July 2011 - 08:56 PM


Is This A Good Question/Topic? 0
  • +

Replies To: Parsing Strings

#2 ishkabible  Icon User is offline

  • spelling expret
  • member icon




Reputation: 1623
  • View blog
  • Posts: 5,714
  • Joined: 03-August 09

Re: Parsing Strings

Posted 18 July 2011 - 09:02 PM

regex isn't the key to parsing, you need a proper parsing method for this to work. the best option is recursive decent parsers, becuase the alternatives (LL and LR) are too unruly to implement to be practical for hand written code(although that isn't to say parser generators are not good options here). also recursive decent parsers let you bend the rules of the grammar a bit to handle ambiguous items.

look at this grammar i just made a simple markup language.

<- can be read, "has the form"
* is the kleen start and means something occurs zero or more times.
| means "or"

opentag <- '<' id attri* '>'
closetag <- '<' '/' id '>'
block <- opentag html closetag
html <- text | block
attri <- id '=' literal


this is a subset of html, that would allow you to easily parse things like XML and HTML(infact it think it may cover XML, not 100% sure though)

you can check out my tutorial on recursive decent parsing,here

this tutorial dose not include a lexer which would be benifical to your project however the ideas are the same. i also used expression parsing rather than a markup language but again the ideas are the same.

hand written parsers like the parsers for Lua are recursive decent parsers. IMO when it comes to string parsing recursive decent parsers are the most powerful method of hand written code. there are other options that are more powerful such as parser generators like Bison and Antler. there are also tools for lexical analysis to aide these parsers. Antler has it's own built in lexical analysis tools. for Bison it is common to use Felx but not completely necessary.

although i would like to see you implement this on your own there are full HTML parsers already out there as easy to use libraries.

This post has been edited by ishkabible: 18 July 2011 - 09:14 PM

Was This Post Helpful? 2
  • +
  • -

#3 Adak  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 331
  • View blog
  • Posts: 1,168
  • Joined: 01-April 11

Re: Parsing Strings

Posted 18 July 2011 - 09:04 PM

You said it in your post - if you want every word that begins with a non digit, then use !isdigit() to test that char.

This post has been edited by Adak: 18 July 2011 - 09:05 PM

Was This Post Helpful? 0
  • +
  • -

#4 doothedew  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 38
  • Joined: 09-July 11

Re: Parsing Strings

Posted 18 July 2011 - 09:52 PM

@ishkabible I don't want to touch my parser because I have built all my test cases around my html parser working the way it does. What I really want to do is figure out how to parse strings. I already get a line of text that is stored in a string and I don't understand how a recursive descent parse helps me to solve the string parsing issue.

@adak If you look at the output that I posted I am able to get words just fine, with the check to see if the first digit isalpha I do the same thing as the !isdigit. The problem, as you can see from the output, is that when I use the strtok method and try to deliminate by every symbol except the underscore and dash, I still get some of the symbols I don't want.

Maybe I should just replace those symbols with blank characters and then deliminate the string with strtok. Would that work?

This post has been edited by doothedew: 18 July 2011 - 09:53 PM

Was This Post Helpful? 0
  • +
  • -

#5 ishkabible  Icon User is offline

  • spelling expret
  • member icon




Reputation: 1623
  • View blog
  • Posts: 5,714
  • Joined: 03-August 09

Re: Parsing Strings

Posted 18 July 2011 - 09:56 PM

View Postdoothedew, on 19 July 2011 - 05:52 AM, said:

...I don't want to touch my parser...What I really want to do is figure out how to parse strings....


explain to my how that one works. your trying to parse HTML but you don't want to touch your parser?

This post has been edited by ishkabible: 18 July 2011 - 09:56 PM

Was This Post Helpful? 1
  • +
  • -

#6 doothedew  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 38
  • Joined: 09-July 11

Re: Parsing Strings

Posted 18 July 2011 - 10:04 PM

I am parsing HTML but what I am trying to learn is how to parse strings. My HTML parse gives me whether it is a start,stop,comment,text, or end token and then I check each token to see if there are any attributes or if it is a text token I get a gaint string. The HTML parser works perfectly, all I want to do is be able to parse through the long string I get for text token. I don't need to grab text from any of the other tokens.
Was This Post Helpful? 0
  • +
  • -

#7 ishkabible  Icon User is offline

  • spelling expret
  • member icon




Reputation: 1623
  • View blog
  • Posts: 5,714
  • Joined: 03-August 09

Re: Parsing Strings

Posted 18 July 2011 - 10:20 PM

ok so how are you trying to parse the strings? are you familiar with lexical analysis? are you familiar with any kind of parser? recursive decent? LL? LR?

what output are you trying to get? a tree perhaps? just a series of tokens(that wouldn't be parsing but rather lexical analysis)?

This post has been edited by ishkabible: 18 July 2011 - 10:21 PM

Was This Post Helpful? 0
  • +
  • -

#8 doothedew  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 38
  • Joined: 09-July 11

Re: Parsing Strings

Posted 19 July 2011 - 07:28 AM

Since I am building a webcrawler I am trying to get an index of words on the webpage (so just text from the title and body). So yes, lexical analysis
Was This Post Helpful? 0
  • +
  • -

#9 NickDMax  Icon User is offline

  • Can grep dead trees!
  • member icon

Reputation: 2250
  • View blog
  • Posts: 9,245
  • Joined: 18-February 07

Re: Parsing Strings

Posted 19 July 2011 - 09:43 AM

So using regex is not terribly hard however it can have its oddities. To really ensure that the regex is working (capturing exactly what you intended) you want a robust test set.

this:

Quote

...I have built all my test cases around my html parser working the way it does.
concerns me a little. Test cases SHOULD NOT be dependent upon the code. You should not have to put too much thought into how the code works to generate test cases, if you ARE then chances are your code is not being tested properly!

The reason is that you will unconsciously tend to make examples that you feel your code "should parse" even if you are consciously trying to come up with examples to "test the limits" of your parser.

test cases should be (as much as possible) independent from the implementation of the code. Ideally they should be completely independent (sometimes we have certain artificial restrictions though, such as size of the data, or we can handle ASCII and not unicode etc. -- things that were left "undefined" in the requirements and so the implementation defines them).

SO! Back to a regex example:
#include <iostream>
#include <regex> //requires a modern compiler!
#include <string>
#include <vector>
using namespace std;
using namespace std::tr1;

const regex rxUpperCase("[A-Z][-A-Za-z0-9_]*"); //uppercase words
const regex rxDigitWords("[0-9][-A-Za-z0-9_]*"); //words beginning with a digit
const regex rxNonDigitWords("[A-Za-z_][-A-Za-z0-9_]*"); //words not-beginning with a digit

void findAll(const string& text, const regex& pattern, vector<string>& output);


int main() {
    string text;
    vector<string> ucWords;
    vector<string> digitWords;
    vector<string> nonDigitWords;
    
    cout << "Please Enter some text: ";
    //getline(cin, text);
    text = "I had 1tiny turtle, his name was Tiny Tim.";
    
    findAll(text, rxUpperCase, ucWords);
    findAll(text, rxDigitWords, digitWords);
    findAll(text, rxNonDigitWords, nonDigitWords);

    cout << "\nUpper Case words:" << endl;
    for(int i = 0; i < ucWords.size(); ++i) {
        cout << ucWords[i] << endl;
    }
    
    cout << "\nWords begining with digits: " << endl;
    for(int i = 0; i < digitWords.size(); ++i) {
        cout << digitWords[i] << endl;
    }
    
    cout << "\nWords NOT begining with digits: " << endl;
    for(int i = 0; i < nonDigitWords.size(); ++i) {
        cout << nonDigitWords[i] << endl;
    }    
    return 0;
}


void findAll(const string& text, const regex& pattern, vector<string>& list) {
    //since we will be using an iterator to find all the matches we need and ending marker
    const sregex_token_iterator endMatches; //empty iterator marks the end of matches.
    
    //define an interator (this could go inside the for-statement but is just messy to do so)
    sregex_token_iterator mIter(text.begin(), text.end(), pattern); 
    for(; mIter != endMatches; ++mIter) { //iterate though and push_back results.
        list.push_back(*mIter);
    }
}


So I used this tutorial and MSDN to work out the "findAll" function.

As for the regex -- well Visual C++'s TR1::RegEx implementation defaults to ECMAScript (javascript) syntax for its regular expressions -- SO you can use web tools to work out the exact regex you need (like say http://regexpal.com/ or maybe a little better http://www.regextester.com/)

There are tons of online tutorials for using regular expressions. I admit sometimes they are a little scary to look at, but like any language start small and learn bits and pieces and eventually they begin to make sense.

This post has been edited by NickDMax: 19 July 2011 - 11:57 AM
Reason for edit:: Fixed 0-0 typo.

Was This Post Helpful? 0
  • +
  • -

#10 ishkabible  Icon User is offline

  • spelling expret
  • member icon




Reputation: 1623
  • View blog
  • Posts: 5,714
  • Joined: 03-August 09

Re: Parsing Strings

Posted 19 July 2011 - 11:58 AM

for more complex string parsing and lexical analysis(mainly were the order of tokens matter) regexs don't cut it. you have to get your hands dirty and make an state machine or hack your way though it. you should look at data flow analysis for creating NFA's for your lexical analyzers. if you were actually going to create a tree(e.g. parse the HTML) then you would need to use something like a recursive decent parser.
Was This Post Helpful? 0
  • +
  • -

#11 NickDMax  Icon User is offline

  • Can grep dead trees!
  • member icon

Reputation: 2250
  • View blog
  • Posts: 9,245
  • Joined: 18-February 07

Re: Parsing Strings

Posted 19 July 2011 - 12:33 PM

You know RegEx gets a bad name when it comes to "parsers" but really they are very powerful and certainly capable of assisting in parsing. However. What you are doing when you are "extracting words" like is is NOT parsing. it is "extracting words" -- parsing implies more structure and syntax. Parsing is about extracting "meaning" based upon a grammar - it is *understanding* a grammar.

RegEx is about pattern matching.

So really they are two completely different fields of thought.

However, RegEx's ability to match patterns can certainly be of assistance in parsing!

Take for example something like a telephone number. Now we can build a parser capable of extracting meaning from (xxx) yyy-zzzz and properly parsing a phone number which will probably take us quite a few lines of code... or we could use RegEx and have it done in 2-3 lines of code.

Take a simple task of syntax hilighting. Technically I should write a C/C++ parser describing the full language out... but I am not really interested in 90% of the knowledge captured in an ADT... So a RegEx based tokenizer can drastically reduce my parser.

point is -- layoff bad mouthing regex dude! :P
Was This Post Helpful? 0
  • +
  • -

#12 Adak  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 331
  • View blog
  • Posts: 1,168
  • Joined: 01-April 11

Re: Parsing Strings

Posted 19 July 2011 - 12:36 PM

You mean you don't want char's like: ! @ # $ % ^ & * ( ) : < > \\ / \? . ; | + = { } [ ] ` ~ \t \n \" \', anywhere in the word strings?

And normally isalpha() would work fine, but you need a couple special char's that isalpha would exclude, right?

Ez-shmeezy, if that's all you want.
Was This Post Helpful? 0
  • +
  • -

#13 ishkabible  Icon User is offline

  • spelling expret
  • member icon




Reputation: 1623
  • View blog
  • Posts: 5,714
  • Joined: 03-August 09

Re: Parsing Strings

Posted 19 July 2011 - 01:56 PM

in this case the OP isn't talking about parsing so yes, regexs can help here and be very useful. however for parsing they are not a good place to look, because the OP said parsing i jumped on the parsing train and gave info regarding parsing.

im not bad mouthing regex, if you just extracting numbers it's fine but parsing has almost no use for regular expressions and infact, has only limited usage in lexical analysis. like you said, they are two different schools of thought. to create a syntax highlight for C++ i would never consider making a full blown parser, that would be ridiculous. however if iwas making a tool for code completion i would make a parser for expressions and regex would not help me at all, it can't even aide in lexical analysis becuase it can't find the order the parts occurred in. e.g. it has a limited use in lexical analysis.
Was This Post Helpful? 0
  • +
  • -

#14 Bench  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 857
  • View blog
  • Posts: 2,343
  • Joined: 20-August 07

Re: Parsing Strings

Posted 19 July 2011 - 02:07 PM

View Postdoothedew, on 19 July 2011 - 06:04 AM, said:

I am parsing HTML but what I am trying to learn is how to parse strings. My HTML parse gives me whether it is a start,stop,comment,text, or end token and then I check each token to see if there are any attributes or if it is a text token I get a gaint string. The HTML parser works perfectly, all I want to do is be able to parse through the long string I get for text token. I don't need to grab text from any of the other tokens.

If you've already extracted your data from the HTML then RegEx is a powerful tool for string parsing. The trick to RegEx is to understand the patterns which you want to match; once you know exactly what you want to look for in your data, you can do all kinds of things with it. The hard part is usually deciding what rules you want to filter on.

e.g. You've specified that you want words which start with an alpha character only. So the match for your first character could look like
[a-zA-Z]
square brackets mean "any of"
a-z denotes a range of characters 'a' to 'z'
A-Z denotes a range of characters 'A' to 'Z'

If you want tokens which contain other alphanumeric characters (or underscores and hyphens) after the first character, then the subsequent part of the pattern could look like
[\\w-]*
Again, square brackets mean "any of"
\\w denotes alphanumeric characters and underscores
the hyphen includes hyphen characters inside the "any of"
the * means "zero or more"

Giving you a simple RegEx pattern of
[a-zA-Z][\\w-]*


RegEx looks ugly at first, but its worth getting comfortable with it. Most programming languages have a RegEx library attached. even Office applications such as MS word have search functionality which uses RegEx. There are loads of tutorials and references out there to help you get used to it.


I'm assuming you're familiar with C++ STL iterators (regex iterators are forward iterators, which work similar to std::list::iterator). You can easily parse a string using the above regex pattern
#include <iostream>
#include <string>
#include <regex>

int main()
{
    std::string str = "the\t    quick9  br0wn\n\n fox 8jumped..ov-er#the,lazy,.dog";
    std::tr1::regex pattern("[a-zA-Z][\\w-]*");
 
    //start/end points of tokens in str
    std::tr1::sregex_token_iterator
        iter(str.begin(), str.end(), pattern),
        end;

    while (iter != end)
    {
        std::cout << *iter++ << std::endl;
    }
}


output
the
quick9
br0wn
fox
jumped
ov-er
the
lazy
dog

Was This Post Helpful? 1
  • +
  • -

Page 1 of 1