3 Replies - 8715 Views - Last Post: 11 March 2013 - 03:05 AM

#1 UnknownCodester  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 24
  • Joined: 05-March 13

Word Counter Program

Posted 09 March 2013 - 02:01 PM

Hi,
I have been trying to sort words based on word count and I have done so but I am having problems.

When I sort them into order, it seems that the word count is reset to 1 for all words?
Does anyone why this is?

(oh yeah and the regular expression are not all perfect they are to eliminate non-words but I havent got round to completing that part).

use CGI qw(:all);
use LWP::Simple qw(get);

$text = "sign-writer been been 'twas up-to-date 'twas has- been <!-- comment --> <here> Assignment1 £Assignment As£ment ";
print "<xmp>$text</xmp >", br(), "\n";

$text = lc($text);
$text =~ s/<(.*?)>//g;
$text =~ s/-{2,}/ /gi;
$text =~ s/[^a-zA-Z0-9\'\-\_]/ /g;


my @array = split(/\s+/,$text);


%count = ();
foreach $string ( @array ) {
$count { $string }++;
}
foreach $key ( keys %count ) {
$value = $count { $key }, br (), "\n";
}


print "<table border='2'>";
print "<th>String</th><th>Count</th>";


foreach $key ( sort {$count{$b} <=> $count{$a}} keys %count)
{
print "$key $value\n";
}



Is This A Good Question/Topic? 0
  • +

Replies To: Word Counter Program

#2 dsherohman  Icon User is offline

  • Perl Parson
  • member icon

Reputation: 226
  • View blog
  • Posts: 654
  • Joined: 29-March 09

Re: Word Counter Program

Posted 10 March 2013 - 01:18 AM

You only have one $value for the entire program and it is set in the loop on lines 20-22. When you get to the loop on lines 30-32 and are sorting the keys, $value will hold the count associated with the last key looked at in the earlier loop and, since you don't assign to it again, that value doesn't change as you loop over the sorted keys.

Try changing line 21 to $count{$key} .= "<br>\n"; and line 31 to print "$key $count{$key}\n"; and you should see the correct counts.

Or, if you change line 31 to print "$key $count{$key}<br>\n";, you could also eliminate the loop at lines 20-22 entirely.

This post has been edited by dsherohman: 10 March 2013 - 01:20 AM

Was This Post Helpful? 1
  • +
  • -

#3 UnknownCodester  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 24
  • Joined: 05-March 13

Re: Word Counter Program

Posted 10 March 2013 - 02:38 PM

I have changed my code to print out in numerical order.
However as my program will read files with a lot of words I was wondering if it possible to print out the ten most common words instead of all of them in numerical order. I was thinking of a for loop to loop only 10 times given a sorted list but I am not sure.

my @array = split(/\s+/,$text);

%count = ();

foreach $string ( @array )
{
$count{$string}++;
$totalwords++;
}

print "The document contains $totalwords words\n";
print "<table border='2'>";
print "Words Occuring Most Often\n";
print "<th>Word</th><th>No of Occurrences</th>";

# Note: This one prints a sorted list of words
foreach $key ( sort {$count{$b} <=> $count{$a}} keys %count)
{
print "<tr><td>$key</td><td>$count{$key}</td></tr>\n";
}

print "<table border='2'>";
print "Words Occuring Least Often\n";
print "<th>Word</th><th>No of Occurrences</th>";

# Note: This one prints a sorted list of words in reverse order
foreach $key ( sort {$count{$a} <=> $count{$b}} keys %count)
{
print "<tr><td>$key</td><td>$count{$key}</td></tr>\n";
}


Also I need to create a regular expression to eliminate words that end in hyphens (as they are not words).
For example Hello- Hello-- Hello--- should be removed.
My regular expression works for Hello- but not for Hello-- or Hello---.
I have attempting with
$text =~ s/\w+-\n//g;

Was This Post Helpful? 0
  • +
  • -

#4 dsherohman  Icon User is offline

  • Perl Parson
  • member icon

Reputation: 226
  • View blog
  • Posts: 654
  • Joined: 29-March 09

Re: Word Counter Program

Posted 11 March 2013 - 03:05 AM

View PostUnknownCodester, on 10 March 2013 - 10:38 PM, said:

I have changed my code to print out in numerical order.
However as my program will read files with a lot of words I was wondering if it possible to print out the ten most common words instead of all of them in numerical order. I was thinking of a for loop to loop only 10 times given a sorted list but I am not sure.


The easiest way to do this (and most efficient, if you want to display both the top 10 and bottom 10) would be to use an array to store the sorted keys, then print the first 10 and last 10 items in that array.
my @sorted = sort {$count{$a} <=> $count{$b}} keys %count;

print "Top 10\n";
for (@sorted[0 .. 9]) {
  print "$_: $count{$_}\n";
}

print "Bottom 10\n";
# Negative array indexes count back from the end of the array
for (@sorted[reverse -10 .. -1]) {
  print "$_: $count{$_}\n";
}


View PostUnknownCodester, on 10 March 2013 - 10:38 PM, said:

Also I need to create a regular expression to eliminate words that end in hyphens (as they are not words).
For example Hello- Hello-- Hello--- should be removed.
My regular expression works for Hello- but not for Hello-- or Hello---.
I have attempting with
$text =~ s/\w+-\n//g;


Yeah, that regex will only match exactly one dash. You need to add a + to allow it to match more than one, the same as you did with the \w. Try $text =~ s/\w+-+\n//g; and it should match Hello-- and Hello--- as well as Hello-. Note, though, that it will only match at the end of a line, since you have the \n there. If the input line has been chomped (which is generally recommended) or if you're checking a word that's been split out from the beginning or middle of the line, then the regex will never match.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1