Welcome to Dream.In.Code
Getting Help is Easy!

Join 86,240 Programmers. There are 2,285 online right now! Ask your question and get quick answers from Dream.In.Code experts. Join the #1 programming help community on the internet! Registration is fast and FREE... Join Now!

Chat LIVE With a Expert
Powered by LivePerson.com

Register to Make This Box Go Away!

PERL pattern matching help

 
Reply to this topicStart new topic

PERL pattern matching help, I need advice on correct pattern matching code

keefer19
post 18 Apr, 2008 - 01:45 PM
Post #1


New D.I.C Head

*
Joined: 8 Dec, 2006
Posts: 16




Hello, If you had a paragraph of text that you wanted to count the number of letters, words, and sentences. What would be the best way to do that? Could you read the file in line by line, split it on every character then count the letters by matching them? Could you also count the sentences by if a character matches a [ ! , ? , or . ]? How would you count the words? Would it be by finding a letter character followed by either a space or a non-letter character? I was wondering if I used these patterns matching lines if they would work

CODE


while($line=<INFILE>)

{
     @char=split // $line;

     foreach $char(@char)
      {
        if(Schar=~/[A-Za-z]/)
            {
                $letterCount++;
             }
        if($char=~/[^A-za-z] /)
            {
               $wordCount++;
            }
        if($char=~/[!?.]/
            {
              $sentenceCount++;
             }
}
User is offlineProfile CardPM
Go to the top of the page
+Quote Post


GravityGuy
post 18 Apr, 2008 - 03:24 PM
Post #2


New D.I.C Head

*
Joined: 21 Jan, 2008
Posts: 25

Assuming the simplest case of ascii text your code would probably be sufficient, since you are using the fact that sentences start with words and words start with letters, and sentences end with a punctuation mark as listed above.

Exceptions are always the killer, and we could identify a few of them. For instance, a comma-space combo usually indicates a phrase within a sentence so multiple characters must be treated as a single delimiter as well. Everything that is not an upper or lower case in not necesarily a word break either since more punctuation may be involved, so called "quoted parts". Notice they occur in pairs.

I think what you have is a parsing problem where you have to look at the incoming text as a string of characters. You enter and exit a state depending on where the cursor is located within the string and what the last character read was catagorized as. For example, you have the two states called SentenceState and WordState. As you read the first character the cursor flips the SentenceState and WordState to true. If the next character read is a whitespace character, the WordState flips to false and you increment the wordCounter. If the next character is a sentence terminator character, the SentenceState flips to false and the SentenceCounter is incremented. If multiple whitespaces occur in sequence they are skipped. As each character is read the CharacterCounter is incremented.

Think of it as writing an XML parser. If the structure is valid, you must have <tag> </tag> pairs. You enter and leave each set of pairs, however complicated the nesting gets, just like {} pairs in programming. All you have to do is keep track of which pairs you have on the stack (a data structure term) and whether they get popped off in the correct order. The text processing algorithm can work the same way using white space and punctuation. You can probably safely ignore capital letters.

Over a large enough sample of text it is possible that the simple rules of counting spaces to indicate the number of words, and counting periods, question marks and exclamation points for the number of sentences is probably not a bad one. It just depends on the level of accuracy you need. How's that for confusing the situation.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

keefer19
post 19 Apr, 2008 - 05:39 AM
Post #3


New D.I.C Head

*
Joined: 8 Dec, 2006
Posts: 16

QUOTE(GravityGuy @ 18 Apr, 2008 - 03:24 PM) *

Assuming the simplest case of ascii text your code would probably be sufficient, since you are using the fact that sentences start with words and words start with letters, and sentences end with a punctuation mark as listed above.

Exceptions are always the killer, and we could identify a few of them. For instance, a comma-space combo usually indicates a phrase within a sentence so multiple characters must be treated as a single delimiter as well. Everything that is not an upper or lower case in not necesarily a word break either since more punctuation may be involved, so called "quoted parts". Notice they occur in pairs.

I think what you have is a parsing problem where you have to look at the incoming text as a string of characters. You enter and exit a state depending on where the cursor is located within the string and what the last character read was catagorized as. For example, you have the two states called SentenceState and WordState. As you read the first character the cursor flips the SentenceState and WordState to true. If the next character read is a whitespace character, the WordState flips to false and you increment the wordCounter. If the next character is a sentence terminator character, the SentenceState flips to false and the SentenceCounter is incremented. If multiple whitespaces occur in sequence they are skipped. As each character is read the CharacterCounter is incremented.

Think of it as writing an XML parser. If the structure is valid, you must have <tag> </tag> pairs. You enter and leave each set of pairs, however complicated the nesting gets, just like {} pairs in programming. All you have to do is keep track of which pairs you have on the stack (a data structure term) and whether they get popped off in the correct order. The text processing algorithm can work the same way using white space and punctuation. You can probably safely ignore capital letters.

Over a large enough sample of text it is possible that the simple rules of counting spaces to indicate the number of words, and counting periods, question marks and exclamation points for the number of sentences is probably not a bad one. It just depends on the level of accuracy you need. How's that for confusing the situation.

User is offlineProfile CardPM
Go to the top of the page
+Quote Post

keefer19
post 19 Apr, 2008 - 05:48 AM
Post #4


New D.I.C Head

*
Joined: 8 Dec, 2006
Posts: 16


Thank you for replying. I'm kind of new at PERL programming so how would you suggest I modify the pattern matching to take care of the exceptions you note? I understand WHAT you are saying I'm just not real sure on HOW to do it :-) Thank you for your help, I really appreciate it.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

KevinADC
post 19 Apr, 2008 - 12:28 PM
Post #5


D.I.C Head

Group Icon
Joined: 23 Jan, 2007
Posts: 168

Is this school work?
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

keefer19
post 19 Apr, 2008 - 03:16 PM
Post #6


New D.I.C Head

*
Joined: 8 Dec, 2006
Posts: 16


Yes, the only part I have a question with is how to get the pattern matching to work. Our book has information on it but the way it describes the matching led be to think the examples I showed would cover what I was trying to do. If the example code will not work, could you please show me an example of what would work and explain the differences so I will be able to learn why and do better next time. Thank you for any help.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

KevinADC
post 19 Apr, 2008 - 10:10 PM
Post #7


D.I.C Head

Group Icon
Joined: 23 Jan, 2007
Posts: 168

You have to narrow the scope of the problem down to some well defined paramaters:

What consitutes a word?
What constitutes word boundaries?
What constitutes a sentence?

The simplest set of parameters is that spaces divide words, words are any non-white space characters, and sentences end with one of a few punctuation marks: .?!

Context is ignored. All you will do is find a few patterns and increment your counters based on those few patterns.

You actually have the right idea although there is an error in your code in this line:

CODE
if(Schar=~/[A-Za-z]/)


You could go about this in one of several ways but I think your code is probably OK for a student and especially for a beginner. Showing you how to do this a better way would not be helpful since it will more than likely leap-frog ahead of your current lesson.





User is offlineProfile CardPM
Go to the top of the page
+Quote Post

keefer19
post 21 Apr, 2008 - 06:07 PM
Post #8


New D.I.C Head

*
Joined: 8 Dec, 2006
Posts: 16


Thanks for the reply. From reading in our textbook, it describes that combination ( [A-Za-z] ) as the same as the \w metacharacter without the _ . Is that wrong? Our Professor said she would cover pattern matching in the last week of class but that we could learn about it and us it on our own if we wanted to. She said we would receive extra credit if we used it in a project ( I already turned in my project without using matching). I don't really care about getting extra credit, I want to learn how to use it so I can write better, more powerful code. When she was asked if we could ask about it on programming forums, she said that "the Internet is there to use, find all you can on it." She told us that this is one of the features that makes PERL a very powerful language.
We only have one project left and that is working with the CGI coding and making an interactive program for our web page. If you still want to wait until our class is over, may I ask for more information about this when our class has ended? School is over the first week in May and I don't mind waiting until then to learn more. Either way, I still want to learn how to use this correctly. Thanks for your help so far and I am looking forword to learning what I can from anyone who is willing to help me.

User is offlineProfile CardPM
Go to the top of the page
+Quote Post

KevinADC
post 21 Apr, 2008 - 06:17 PM
Post #9


D.I.C Head

Group Icon
Joined: 23 Jan, 2007
Posts: 168

QUOTE
[A-Za-z] ) as the same as the \w metacharacter without the _ . Is that wrong?


Technically it is wrong. \w is the same as [A-Za-z0-9_]

the digits are included.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

Petebardo
post 5 May, 2008 - 05:29 PM
Post #10


New D.I.C Head

*
Joined: 14 Dec, 2006
Posts: 12

Here's another way to get this done. It lets the regular expression engine do all the work. If the file you're reading is large, this might not work for you.

CODE

my $text;
while (<FH>) {
    $text .= $_
}
my $charcount = $text =~ s/(\w)/$1/g;
my $wordcount = $text =~ s/(\w+?\W)/$1/g;
my $sentencecount = $text =~ s/([!?.])/$1/g;


The syntax assigns the number of substitutions made to a variable. There is probably a better way to read the entire file into one variable.

Just a thought...
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

Fast ReplyReply to this topicStart new topic
Time is now: 5/16/08 08:14AM

Live Help!

Tutorials

Programming

Web Development

Reference Sheets

Code Snippets

Bye Bye Ads

Free DIC T-Shirt

T-Shirt Example

Related Sites

Monthly Drawing

Thumb Drive

Partners

Top Contributors

Top 10 Kudos This Month