PERL pattern matching help

I need advice on correct pattern matching code

Page 1 of 1

9 Replies - 2606 Views - Last Post: 05 May 2008 - 05:29 PM

#1 keefer19  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 17
  • Joined: 08-December 06

PERL pattern matching help

Posted 18 April 2008 - 01:45 PM

Hello, If you had a paragraph of text that you wanted to count the number of letters, words, and sentences. What would be the best way to do that? Could you read the file in line by line, split it on every character then count the letters by matching them? Could you also count the sentences by if a character matches a [ ! , ? , or . ]? How would you count the words? Would it be by finding a letter character followed by either a space or a non-letter character? I was wondering if I used these patterns matching lines if they would work


 while($line=<INFILE>)

{
	 @char=split // $line;

	 foreach $char(@char)
	  {
		if(Schar=~/[A-Za-z]/)
			{
				$letterCount++;
			 }
		if($char=~/[^A-za-z] /)
			{
			   $wordCount++;
			}
		if($char=~/[!?.]/
			{
			  $sentenceCount++;
			 }
}



Is This A Good Question/Topic? 0
  • +

Replies To: PERL pattern matching help

#2 GravityGuy  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 48
  • Joined: 21-January 08

Re: PERL pattern matching help

Posted 18 April 2008 - 03:24 PM

Assuming the simplest case of ascii text your code would probably be sufficient, since you are using the fact that sentences start with words and words start with letters, and sentences end with a punctuation mark as listed above.

Exceptions are always the killer, and we could identify a few of them. For instance, a comma-space combo usually indicates a phrase within a sentence so multiple characters must be treated as a single delimiter as well. Everything that is not an upper or lower case in not necesarily a word break either since more punctuation may be involved, so called "quoted parts". Notice they occur in pairs.

I think what you have is a parsing problem where you have to look at the incoming text as a string of characters. You enter and exit a state depending on where the cursor is located within the string and what the last character read was catagorized as. For example, you have the two states called SentenceState and WordState. As you read the first character the cursor flips the SentenceState and WordState to true. If the next character read is a whitespace character, the WordState flips to false and you increment the wordCounter. If the next character is a sentence terminator character, the SentenceState flips to false and the SentenceCounter is incremented. If multiple whitespaces occur in sequence they are skipped. As each character is read the CharacterCounter is incremented.

Think of it as writing an XML parser. If the structure is valid, you must have <tag> </tag> pairs. You enter and leave each set of pairs, however complicated the nesting gets, just like {} pairs in programming. All you have to do is keep track of which pairs you have on the stack (a data structure term) and whether they get popped off in the correct order. The text processing algorithm can work the same way using white space and punctuation. You can probably safely ignore capital letters.

Over a large enough sample of text it is possible that the simple rules of counting spaces to indicate the number of words, and counting periods, question marks and exclamation points for the number of sentences is probably not a bad one. It just depends on the level of accuracy you need. How's that for confusing the situation.
Was This Post Helpful? 0
  • +
  • -

#3 keefer19  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 17
  • Joined: 08-December 06

Re: PERL pattern matching help

Posted 19 April 2008 - 05:39 AM

View PostGravityGuy, on 18 Apr, 2008 - 03:24 PM, said:

Assuming the simplest case of ascii text your code would probably be sufficient, since you are using the fact that sentences start with words and words start with letters, and sentences end with a punctuation mark as listed above.

Exceptions are always the killer, and we could identify a few of them. For instance, a comma-space combo usually indicates a phrase within a sentence so multiple characters must be treated as a single delimiter as well. Everything that is not an upper or lower case in not necesarily a word break either since more punctuation may be involved, so called "quoted parts". Notice they occur in pairs.

I think what you have is a parsing problem where you have to look at the incoming text as a string of characters. You enter and exit a state depending on where the cursor is located within the string and what the last character read was catagorized as. For example, you have the two states called SentenceState and WordState. As you read the first character the cursor flips the SentenceState and WordState to true. If the next character read is a whitespace character, the WordState flips to false and you increment the wordCounter. If the next character is a sentence terminator character, the SentenceState flips to false and the SentenceCounter is incremented. If multiple whitespaces occur in sequence they are skipped. As each character is read the CharacterCounter is incremented.

Think of it as writing an XML parser. If the structure is valid, you must have <tag> </tag> pairs. You enter and leave each set of pairs, however complicated the nesting gets, just like {} pairs in programming. All you have to do is keep track of which pairs you have on the stack (a data structure term) and whether they get popped off in the correct order. The text processing algorithm can work the same way using white space and punctuation. You can probably safely ignore capital letters.

Over a large enough sample of text it is possible that the simple rules of counting spaces to indicate the number of words, and counting periods, question marks and exclamation points for the number of sentences is probably not a bad one. It just depends on the level of accuracy you need. How's that for confusing the situation.

Was This Post Helpful? 0
  • +
  • -

#4 keefer19  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 17
  • Joined: 08-December 06

Re: PERL pattern matching help

Posted 19 April 2008 - 05:48 AM

Thank you for replying. I'm kind of new at PERL programming so how would you suggest I modify the pattern matching to take care of the exceptions you note? I understand WHAT you are saying I'm just not real sure on HOW to do it :-) Thank you for your help, I really appreciate it.
Was This Post Helpful? 0
  • +
  • -

#5 KevinADC  Icon User is offline

  • D.I.C Regular
  • member icon

Reputation: 27
  • View blog
  • Posts: 401
  • Joined: 23-January 07

Re: PERL pattern matching help

Posted 19 April 2008 - 12:28 PM

Is this school work?
Was This Post Helpful? 0
  • +
  • -

#6 keefer19  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 17
  • Joined: 08-December 06

Re: PERL pattern matching help

Posted 19 April 2008 - 03:16 PM

Yes, the only part I have a question with is how to get the pattern matching to work. Our book has information on it but the way it describes the matching led be to think the examples I showed would cover what I was trying to do. If the example code will not work, could you please show me an example of what would work and explain the differences so I will be able to learn why and do better next time. Thank you for any help.
Was This Post Helpful? 0
  • +
  • -

#7 KevinADC  Icon User is offline

  • D.I.C Regular
  • member icon

Reputation: 27
  • View blog
  • Posts: 401
  • Joined: 23-January 07

Re: PERL pattern matching help

Posted 19 April 2008 - 10:10 PM

You have to narrow the scope of the problem down to some well defined paramaters:

What consitutes a word?
What constitutes word boundaries?
What constitutes a sentence?

The simplest set of parameters is that spaces divide words, words are any non-white space characters, and sentences end with one of a few punctuation marks: .?!

Context is ignored. All you will do is find a few patterns and increment your counters based on those few patterns.

You actually have the right idea although there is an error in your code in this line:

if(Schar=~/[A-Za-z]/)


You could go about this in one of several ways but I think your code is probably OK for a student and especially for a beginner. Showing you how to do this a better way would not be helpful since it will more than likely leap-frog ahead of your current lesson.
Was This Post Helpful? 0
  • +
  • -

#8 keefer19  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 17
  • Joined: 08-December 06

Re: PERL pattern matching help

Posted 21 April 2008 - 06:07 PM

Thanks for the reply. From reading in our textbook, it describes that combination ( [A-Za-z] ) as the same as the \w metacharacter without the _ . Is that wrong? Our Professor said she would cover pattern matching in the last week of class but that we could learn about it and us it on our own if we wanted to. She said we would receive extra credit if we used it in a project ( I already turned in my project without using matching). I don't really care about getting extra credit, I want to learn how to use it so I can write better, more powerful code. When she was asked if we could ask about it on programming forums, she said that "the Internet is there to use, find all you can on it." She told us that this is one of the features that makes PERL a very powerful language.
We only have one project left and that is working with the CGI coding and making an interactive program for our web page. If you still want to wait until our class is over, may I ask for more information about this when our class has ended? School is over the first week in May and I don't mind waiting until then to learn more. Either way, I still want to learn how to use this correctly. Thanks for your help so far and I am looking forword to learning what I can from anyone who is willing to help me.
Was This Post Helpful? 0
  • +
  • -

#9 KevinADC  Icon User is offline

  • D.I.C Regular
  • member icon

Reputation: 27
  • View blog
  • Posts: 401
  • Joined: 23-January 07

Re: PERL pattern matching help

Posted 21 April 2008 - 06:17 PM

Quote

[A-Za-z] ) as the same as the \w metacharacter without the _ . Is that wrong?


Technically it is wrong. \w is the same as [A-Za-z0-9_]

the digits are included.
Was This Post Helpful? 0
  • +
  • -

#10 Petebardo  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 12
  • Joined: 14-December 06

Re: PERL pattern matching help

Posted 05 May 2008 - 05:29 PM

Here's another way to get this done. It lets the regular expression engine do all the work. If the file you're reading is large, this might not work for you.

my $text;
while (<FH>) {
	$text .= $_
}
my $charcount = $text =~ s/(\w)/$1/g;
my $wordcount = $text =~ s/(\w+?\W)/$1/g;
my $sentencecount = $text =~ s/([!?.])/$1/g;



The syntax assigns the number of substitutions made to a variable. There is probably a better way to read the entire file into one variable.

Just a thought...
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1