Word Frequency

Find the number of unique words in a text file.

Page 1 of 1

9 Replies - 20182 Views - Last Post: 04 April 2010 - 12:20 PM Rate Topic: -----

#1 aeolusaether  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 4
  • Joined: 03-April 10

Word Frequency

Posted 03 April 2010 - 03:08 PM

I'm pretty new to this forum, although I have run into dream.in.code from time to time when Googling.
I need help with a homework assignment, which I haven't been able to make much headway with.

Assignment:
Count the number of times each word in a text file appears. Print the results sorted in decreasing order by word count. Let the user specify the text file as a command-line argument. Note: I can work in C or C++, but I understand C better.

My Work:
I have several theories, each of which seems more unlikely than the last. I first created a program to read the entire text file, character by character into terminal. I then made the program count the number of spaces in the text file. It was my hope that I could modify this program to create a list of all the words that appeared in the text file, since it could recognize spaces. Here is my code:
#include <stdio.h>

int spaces = 0;

int main ( int argc, char *argv[] )
{
    if ( argc != 2 ) 
    {
        printf( "usage: %s filename", argv[0] );
    }
    else 
    {
        FILE *file = fopen( argv[1], "r" );

        if ( file == 0 )
        {
            printf( "Could not open file\n" );
        }
        else 
        {
            int x;
			int i;

            while  ( ( x = fgetc( file ) ) != EOF )
            {
                printf( "%c", x );
				if (x ==' ')
			{	
			spaces++;
			}
            }
		
        }
        fclose( file );
		printf ( "\n\n The number of spaces contained within the file is: %d", spaces ); 
    }
}


I don't know if this approach will work or not.
I also attempted to modify this example code to halt at each space, and concantate the characters found so far into another variable/string "z".

#include <stdio.h>
#include <string.h>

int main() {
  FILE *file;
  char *c; /* make sure it is large enough to hold all the data! */
  char b[200];
  char *z;
  char *d;
  int n;

  file = fopen("zzz.txt", "r");

  if(file==NULL) {
    printf("Error: can't open file.\n");
    return 1;
  }
  else {
    printf("File opened successfully.\n");
   
//    n = fread(c, 1, 10, file); /* passing a char array, 
//                                  reading 10 characters */
//    c[n] = '\0';               /* a char array is only a 
//                                  string if it has the
//                                  null character at the end */
//    printf("%s\n", c);         /* print out the string      */
//    printf("Characters read: %d\n\n", n);
//
//    fclose(file);          /* to read the file from the beginning, */
//                           /* we need to close and reopen the file */
//    file = fopen("numbers.txt", "r");
	
	while(1) {     /* keep looping... */
      c = fgetc(file);
      if(c!=" ") {
	  z = strcat(z,c);
		printf("%z\n", c);  
		 }
	  else {
        continue;     /* ...break when EOF is reached */
      }
	  if(c==EOF);
	  {
	  break; 
      }
	}
//    n = fread(d, 1, 10, file);
            /* passing a char pointer this time - 10 is irrelevant */
    
	
//	printf("%s\n", d);
//    printf("Characters read: %d\n\n", n);

    fclose(file);
    return 0;
  }
}




It was my hope that stopping at each space and appending the letters found so far into a variable/string/char would create a list of words for me - not worrying about whether it was unique yet or not. Then I could write some more code to check how many times the contents of each line appeared uniquely, thus generating my list of words and the number of times they appear.

Somebody's going to tell me that I'm trying to reinvent maps or structs or that I can't typecast to save my life. In truth, I don't fully understand these functions too well. I happily understand every function in the second chunk of code up there, and I'll cheerfully learn about whichever functions can help me.

That said, if you give me code, and you do 1, 3, or 4 differently, please explain a little bit. Because I'm familiar with things like fopen, fclose, only.
1. Open a text file.
2. Parse the file for words. <--- need help
3. Print to terminal.
4. Close the file.

Any help, including criticism at my approach, is appreciated.

AA.

Is This A Good Question/Topic? 0
  • +

Replies To: Word Frequency

#2 PlasticineGuy  Icon User is offline

  • mov dword[esp+eax],0
  • member icon

Reputation: 281
  • View blog
  • Posts: 1,436
  • Joined: 03-January 10

Re: Word Frequency

Posted 03 April 2010 - 03:40 PM

It would be MUCH easier to read the whole thing into a string and then search the string for occurrences of words. This would be a lot easier in C++ because of the functions pre-programmed for the C++ standard string class.
Was This Post Helpful? 0
  • +
  • -

#3 JackOfAllTrades  Icon User is offline

  • Saucy!
  • member icon

Reputation: 6110
  • View blog
  • Posts: 23,670
  • Joined: 23-August 08

Re: Word Frequency

Posted 03 April 2010 - 05:24 PM

Cross-posted here, so don't waste your time duplicating effort.
Was This Post Helpful? 0
  • +
  • -

#4 aeolusaether  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 4
  • Joined: 03-April 10

Re: Word Frequency

Posted 03 April 2010 - 05:38 PM

I was under the impression that I'm free to post my question to different forums, as that's my choice. I wrote the original question on dream.in.code, and was looking for someone to point me in the right direction. If that can't happen when I pose the same question to a different forum, I really don't care. I've never heard of cross posting, and I don't care to abide by what you think I should do off your website.

Aside
Thank you, PlasticineGuy. I'm going to look for example scripts on reading txt files into strings.
Was This Post Helpful? 0
  • +
  • -

#5 PatTheGamer  Icon User is offline

  • New D.I.C Head

Reputation: 7
  • View blog
  • Posts: 13
  • Joined: 01-April 10

Re: Word Frequency

Posted 03 April 2010 - 05:50 PM

In C there is a way to grab just one word from a file. So for instance you could do the following. (Note: this code just reads in a file, input.txt, and spits out each word on a different line.)

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>

int main()
{

    FILE * input;
    char buf[256];

    input = fopen("./input.txt","r");

    if(input == NULL)
    {
        perror("input.txt");
        exit(1);
    }

    while(!feof(input))
    {
        fscanf(input,"%s ",&buf);
        printf("Word: %s.\n",&buf);

    }

    fclose(input);
    return 0;
}


From this you can get a word at a time. Now you have to address the issue of counting each word, this will require some memory management in C. This can get messy. So using C++ might be better for that, but it can be done in C with malloc. Just remember that for every malloc, there should be a free as well.

Hope this helps. :bigsmile:
Was This Post Helpful? 1
  • +
  • -

#6 JackOfAllTrades  Icon User is offline

  • Saucy!
  • member icon

Reputation: 6110
  • View blog
  • Posts: 23,670
  • Joined: 23-August 08

Re: Word Frequency

Posted 03 April 2010 - 07:25 PM

You're free to post to how ever many forums you please, just as I'm free to let people know that you're doing so, and where you do it, so people don't waste the time and expertise they're providing -- for free -- only to find the question has been answered elsewhere. In short, it's widely considered rude, as you're taking other people's time for granted.
Was This Post Helpful? 0
  • +
  • -

#7 aeolusaether  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 4
  • Joined: 03-April 10

Re: Word Frequency

Posted 03 April 2010 - 07:44 PM

I think my immediate response to that is that once I reach a solution, regardless of where I received my help, it would be common courtesy to immediately post that solution to the thread. That done, no one else who followed the forum rules, and read the full thread, or at least the OP's posts in the thread would be wasting their time. I also asked a Yahoo! Answers question about a particular aspect of writing this program, is that wrong too?

Dream In Code provided me with a way to take all the words in a text file, and by turning them in to a string, put them on different lines. If you go over to cprogramming, they've entered into a discussion of how I should store this list of words. At this point, I'm trying to decide whether to switch over to C++ and use maps, or create a linked list.

Thanks,

AA.
Was This Post Helpful? 0
  • +
  • -

#8 baavgai  Icon User is offline

  • Dreaming Coder
  • member icon

Reputation: 5937
  • View blog
  • Posts: 12,862
  • Joined: 16-October 07

Re: Word Frequency

Posted 04 April 2010 - 07:44 AM

View PostPlasticineGuy, on 03 April 2010 - 04:40 PM, said:

It would be MUCH easier to read the whole thing into a string and then search the string for occurrences of words. This would be a lot easier in C++ because of the functions pre-programmed for the C++ standard string class.


Well, it would be much much easier to use Python. Recommending alternate languages doesn't solve the problem in the chosen one. Worse, avoiding the problem in a given language because an alternative is easier doesn't teach you anything.

For this particular problem, reading the whole mess into a string is a poor idea. Reading one word at a time from a stream is a much better idea. Indeed, if I had a string in C++ I'd probably make a stringstream... I'd also use a map in C++. In C, you don't have that; I'd make a simple linked list with an ordered insert.

The assignment calls for word frequencies. Here's a skeleton of how I'd do it:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef struct NodeStruct {
	char *word;
	int freq;
	struct NodeStruct *next;
} Node;

void strLower(char *s);
Node *createNode(const char *word, Node *next);
void addWord(Node **words, const char *word);
void showWords(Node *p);
void clearWords(Node **words);
void processFile(const char *fileName);

int main () {
	processFile("foo.txt");
	return 0;
}

#define WORD_SIZE 128
void processFile(const char *fileName) {
	FILE *pFile = fopen (fileName, "r");
	if (pFile == NULL) {
		perror("Error opening file");
	} else {
		Node *words = NULL;
		while (!feof(pFile)) {
			char word[WORD_SIZE];
			fscanf(pFile, "%s", word);
			strLower(word);
			addWord(&words, word);
		}
		fclose (pFile);
		showWords(words);
		clearWords(&words);
	}
}



For completeness, and maybe it will help, here's how you do the entire problem in Python. :P
def showWordFreq(fileName):
	wordList = {}
	file = open(fileName)
	for line in file:
		for word in line.lower().split():
			if word in wordList:
				wordList[word] += 1
			else:
				wordList[word] = 1
	words = wordList.keys();
	words.sort()
	for word in words:
		print "%d\t%s" % (wordList[word], word)


Was This Post Helpful? 1
  • +
  • -

#9 aeolusaether  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 4
  • Joined: 03-April 10

Re: Word Frequency

Posted 04 April 2010 - 11:47 AM

View Postbaavgai, on 04 April 2010 - 06:44 AM, said:

View PostPlasticineGuy, on 03 April 2010 - 04:40 PM, said:

It would be MUCH easier to read the whole thing into a string and then search the string for occurrences of words. This would be a lot easier in C++ because of the functions pre-programmed for the C++ standard string class.


Well, it would be much much easier to use Python. Recommending alternate languages doesn't solve the problem in the chosen one. Worse, avoiding the problem in a given language because an alternative is easier doesn't teach you anything.

For this particular problem, reading the whole mess into a string is a poor idea. Reading one word at a time from a stream is a much better idea. Indeed, if I had a string in C++ I'd probably make a stringstream... I'd also use a map in C++. In C, you don't have that; I'd make a simple linked list with an ordered insert.

The assignment calls for word frequencies. Here's a skeleton of how I'd do it:


baavgai, that was brilliant. Thank you for providing me with an example which shows me the process of creating a Node/struct in C and compiling the resulting list. I'll figure out sorting and the specifics myself. From all the people that's been helping me with this problem, you've gotten me farthest!

This topic is now resolved. I'll post code as soon as I'm done writing it.

Thanks,

AA.
Was This Post Helpful? 0
  • +
  • -

#10 baavgai  Icon User is offline

  • Dreaming Coder
  • member icon

Reputation: 5937
  • View blog
  • Posts: 12,862
  • Joined: 16-October 07

Re: Word Frequency

Posted 04 April 2010 - 12:20 PM

Excellent. Happy to help. I didn't sort the list. Rather, when the value is added, I added it in order.

This actually makes sense, because you can combine your insert with your find. You move down the list while strcmp < 0. If the node you find is ==0, you've found a match, increment the frequency. Otherwise, insert a new node, frequency 1, in order.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1