Finding common words in a file using lexicographical matching help

  • (7 Pages)
  • +
  • 1
  • 2
  • 3
  • Last »

94 Replies - 3422 Views - Last Post: 05 September 2012 - 08:19 AM Rate Topic: -----

#1 donnie7  Icon User is offline

  • D.I.C Head

Reputation: 0
  • View blog
  • Posts: 55
  • Joined: 12-February 11

Finding common words in a file using lexicographical matching help

Posted 30 August 2012 - 07:32 PM

I'm supposed to determine and list the 50 most common words in a file.

#include <iostream>
#include <fstream>
#include <string>

using namespace std;

int main()
{

    string filename;

    // Get the filename.
    cout << "Enter the file you wish to have searched:\n";
    cin >> filename;

    // Open file.
    ifstream file(filename.c_str());

    // Read in all the words.
    string word;

/////??? search for matches lexicographically ????///////////




Now after this, I'm not sure what to do. My teacher wants me to search for matches lexicographically but I don't know how to do that. Could anyone please offer help on how to search for word matches lexicographically. Then I will be outputting the top 50 most used words after finding these matches lexicographically.

Is This A Good Question/Topic? 0
  • +

Replies To: Finding common words in a file using lexicographical matching help

#2 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 3535
  • View blog
  • Posts: 10,944
  • Joined: 05-May 12

Re: Finding common words in a file using lexicographical matching help

Posted 30 August 2012 - 07:39 PM

You haven't even read in all the words yet...

Isn't this the same as this question here: http://www.dreaminco...exigraphically/
Was This Post Helpful? 0
  • +
  • -

#3 vividexstance  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 653
  • View blog
  • Posts: 2,240
  • Joined: 31-December 10

Re: Finding common words in a file using lexicographical matching help

Posted 31 August 2012 - 09:17 AM

Check out this link for the lexicographical_compare function in the algorithm header.

You could also just use the member functions from the std::string class to search for the words.

The easiest way I think would be to use an associate array/container, the std::map class can be a big help. You could create a map that has a std::string for a key, this would be each unique word, and the mapped value could be an integer which would hold the word frequency(count):
std::map<std::string, int> wordCount;
// because a word count shouldn't be negative
// this is probably a better choice:
std::map<std::string, unsigned int> wordCount;


To check if a word is in the map, just use the count() member function:
// ...
std::string word = "asdf";
// Check if the word "asdf" is in the map:
if(wordCount.count(word) == 0)
    wordCount[word] = 1;


Somebody correct me if I'm wrong, but when you insert a key into the map for the first time and you don't give it a value, the mapped type (int in this case) is default constructed, so it's initialized with a zero. So you don't even need to check if the word is in the map first, you just need to increment the count:
wordCount[word]++;


That line will insert a new key-value pair into the map if the word isn't already in there and either way it will increment the value. When the word is first inserted, its value is zero, so it's incremented to 1, if the word was already in the map, the count is just incremented. After you've inserted all words, you will have a container that holds each unique word and it's associated count.
Was This Post Helpful? 0
  • +
  • -

#4 donnie7  Icon User is offline

  • D.I.C Head

Reputation: 0
  • View blog
  • Posts: 55
  • Joined: 12-February 11

Re: Finding common words in a file using lexicographical matching help

Posted 01 September 2012 - 12:24 PM

Ok i'm still trying to figure this out. I dont have a lot of C++ background, so i'm not sure how to efficiently use this map class. Heres my code so far:

#include <iostream>
#include <fstream>
#include <string>

using namespace std;

int main()
{

    string filename;

    // Get the filename.
    cout << "Enter the file you wish to have searched:\n";
    cin >> filename;

    // Open file and store into an array ...c_str()
    ifstream file(filename.c_str());

// Pulling in the words to be strings
std::string word; 

//map that holds the word and the number of times it appears because of 'string' and 'int'
std::map<std::string, unsigned int> wordCount;



//Until the end of the file is reached
 while (!fin.eof( )) {

// insert a new key-value pair into the map if the word isn't already in there and increment the value
	wordCount[word]++;
 }

}




So does this automatically jump to the next word in my file? Also, if this is correct, should i use some kind of sort method to print ONLY the most common 50 words? I'm still having problems compiling it. For some reason the c++ compiler doesnt like the way i tried to tell when the end of the file is reached.

This post has been edited by donnie7: 01 September 2012 - 12:38 PM

Was This Post Helpful? 0
  • +
  • -

#5 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 3535
  • View blog
  • Posts: 10,944
  • Joined: 05-May 12

Re: Finding common words in a file using lexicographical matching help

Posted 01 September 2012 - 01:32 PM

You should paste in the exact error you are getting. As a freebie, though, I'm getting that you wanted to use feof() instead of eof().

That will get you past the compilation error, but you'll still end up with a logic error because the EOF flag is only set after you've done some kind of stream operation that force the file pointer to the end of the file.

There are two approaches to this:

One way is to try to read a line, and if the read line fails, break out of your loop. Once you have a line, you'll need to break the line up into words and then update your frequencies for each word, and then loop back and try to read the next line.

The other approach is to take advantage of the >> stream operator behavior when reading strings. It reads characters until the next whitespace. If the read failed, then break out of the loop, otherwise update the frequency for that word. Then try to read the next word.
Was This Post Helpful? 0
  • +
  • -

#6 donnie7  Icon User is offline

  • D.I.C Head

Reputation: 0
  • View blog
  • Posts: 55
  • Joined: 12-February 11

Re: Finding common words in a file using lexicographical matching help

Posted 01 September 2012 - 01:50 PM

View PostSkydiver, on 01 September 2012 - 01:32 PM, said:

The other approach is to take advantage of the >> stream operator behavior when reading strings. It reads characters until the next whitespace. If the read failed, then break out of the loop, otherwise update the frequency for that word. Then try to read the next word.


Ill try this approach. How is a whitespace defined in c++? Would it be this?

while (word >> '\n ')
wordCount[word]++;

else   

///code here///



This post has been edited by donnie7: 01 September 2012 - 02:04 PM

Was This Post Helpful? 0
  • +
  • -

#7 donnie7  Icon User is offline

  • D.I.C Head

Reputation: 0
  • View blog
  • Posts: 55
  • Joined: 12-February 11

Re: Finding common words in a file using lexicographical matching help

Posted 01 September 2012 - 02:10 PM

If i use strings to read this document i will have tons of lines of code. My teacher gave us this our assignment for my Computer Graphics class to see where we are at and i feel like i can't do this without help. I have limited coding experience.

Skydiver, is what you are telling me to do a completely different way of going about this than what vividexstance told me to do? Do i have to do a combination of things? I just do not know how people can know exactly what template to use when writing a program and when.
Was This Post Helpful? 0
  • +
  • -

#8 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 3535
  • View blog
  • Posts: 10,944
  • Joined: 05-May 12

Re: Finding common words in a file using lexicographical matching help

Posted 01 September 2012 - 02:56 PM

Sorry about the feof() reference earlier. I've been jumping between C and C++ a lot recently. You're use should have been correct, but the logic error still persists about the EOF flag being set only after some operation has been done on the stream.

What vividexstance in talking about is how to update your word counts using the std::map<>. For that to work, you need to have the string word; set to a value for each iteration. I'm trying to help you set it.

You are close. The >> operator when reading strings will just collect characters until it hits whitespace.

What you want is something like this in pseudo code:
ifstream file;
string word;
// open the file
while (true)
{
    file >> word;
    if (file.fail())
        break;

    // update word frequencies
}



Or if you are more advanced you can take advantage of the way the stream's ! operator, and have something more concise, but requires more in depth C++ knowledge from the reader:
ifstream file;
string word;
// open the file
while (file >> word)
{
    // update word frequencies
}


Was This Post Helpful? 0
  • +
  • -

#9 donnie7  Icon User is offline

  • D.I.C Head

Reputation: 0
  • View blog
  • Posts: 55
  • Joined: 12-February 11

Re: Finding common words in a file using lexicographical matching help

Posted 01 September 2012 - 03:33 PM

Ok I've included it in my code. So now it will count the repeating and non repeating words. So now i need to make it display ONLY the most common 50 words. Isnt this map class already an array, so somehow i would have to take out the highest 50 repeated words in the array. I could do this by first sorting the map 'wordCount' and then outputting ONLY the first 50. Is this even possible to do with this type of 'map array' ? Because it has both the word and number stored in it, and i just need the word.

Is there some kind of built in 'sort' i can use with this map class?

Please share your advice, thanks

#include <iostream>
#include <fstream>
#include <string>
#include <map>

using namespace std;



int main()
{
    string filename;

    // Get the filename.
    cout << "Enter the file you wish to have searched:\n";
    cin >> filename;

    // Open file and store into an array ...c_str()
    ifstream file(filename.c_str());

// Pulling in the words to be strings
std::string word; 

//map that holds the word and the number of times it appears because of 'string' and 'int'
std::map<std::string, unsigned int> wordCount;

//Runs iterator until the end
while (true)
{
    file >> word;
    if (file.fail())
        break;

// insert a new key-value pair into the map if the word isn't already in there and increment the value
	wordCount[word]++;
 }

}

This post has been edited by donnie7: 01 September 2012 - 03:38 PM

Was This Post Helpful? 0
  • +
  • -

#10 #define  Icon User is offline

  • Duke of Err
  • member icon

Reputation: 1327
  • View blog
  • Posts: 4,554
  • Joined: 19-February 09

Re: Finding common words in a file using lexicographical matching help

Posted 01 September 2012 - 05:14 PM

Yes, the map is sorted on the words. Once you have read all the data you need to sort it by the count. Not sure if you can do that in map. You could transfer the data from the map to a list (has in-built sort) or vector. There are two pieces of data: string and integer, so you will need a struct/class or pair.
Was This Post Helpful? 0
  • +
  • -

#11 donnie7  Icon User is offline

  • D.I.C Head

Reputation: 0
  • View blog
  • Posts: 55
  • Joined: 12-February 11

Re: Finding common words in a file using lexicographical matching help

Posted 01 September 2012 - 06:29 PM

Ok I stored the key in one vector and the value in another vector. The first vector, which is the key, works fine and prints it out like a dictionary. However, I am having trouble outputting the second vector, which is the value. Will this still be okay if i do not pair them and just have the key and value each in separate vectors?

In other words, will i still be able to sort these fine, and what exactly am i doing wrong in my get_second struct? I think it has something to do with it being an int and not a string like in my get_first struct.


#include "stdafx.h"
#include <cstdlib>
#include <iostream>
#include <algorithm>
#include <iterator>
#include <vector>
#include <fstream>
#include <string>
#include <map>

using namespace std;

//map that holds the word and the number of times it appears because of 'string' and 'int'
typedef map<std::string, unsigned int> wordCount;
//constructor
wordCount my_count;

struct get_first : public std::unary_function<wordCount::value_type, string>
{
    string operator()(const wordCount::value_type& value) const
    {
        return value.first;
    }
};

struct get_second : public std::unary_function<wordCount::value_type, unsigned int>
{
    string operator()(const wordCount::value_type& value) const
    {
        return value.second;
    }
};


int main()
{
    string filename;


    // Get the filename.
    cout << "Enter the file you wish to have searched:\n";
    cin >> filename;

    // Open file and store into an array ...c_str()
    ifstream file(filename.c_str());

// Pulling in the words to be strings
std::string word; 



//Collects words until the end
while (true)
{
    file >> word;
    if (file.fail())
        break;

// insert a new key-value pair into the map if the word isn't already in there and increment the value
	++my_count[word];
 }

    // get a vector of values first key
    vector<string> my_key;
    transform(my_count.begin(), my_count.end(), back_inserter(my_key), get_first() );

    // get a vector of values second value
   vector<unsigned int> my_value;
  transform(my_count.begin(), my_count.end(), back_inserter(my_value), get_second() );


//below is for output testing
	   // dump the list
    copy( my_key.begin(), my_key.end(), ostream_iterator<string>(cout, "\n"));

		   // dump the list
   copy( my_value.begin(), my_value.end(), ostream_iterator<string>(cout, "\n"));
}



This post has been edited by donnie7: 01 September 2012 - 06:30 PM

Was This Post Helpful? 0
  • +
  • -

#12 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 3535
  • View blog
  • Posts: 10,944
  • Joined: 05-May 12

Re: Finding common words in a file using lexicographical matching help

Posted 01 September 2012 - 07:00 PM

When you start trying to keep two arrays in parallel with each other, it's usually a sign that you need a data structure to hold the elements that those arrays held separately.

Although the ready may std::pair<> is convenient, I would lean towards defining my own struct.
struct WordFrequency
{
    string Word;
    int Frequency;
};

vector<WordFrequency> wordFrequencies;



It should then be easy to iterate over the map, push_back() WordFrequencies into the vector, and then later call sort().
Was This Post Helpful? 0
  • +
  • -

#13 donnie7  Icon User is offline

  • D.I.C Head

Reputation: 0
  • View blog
  • Posts: 55
  • Joined: 12-February 11

Re: Finding common words in a file using lexicographical matching help

Posted 01 September 2012 - 09:00 PM

View PostSkydiver, on 01 September 2012 - 07:00 PM, said:

When you start trying to keep two arrays in parallel with each other, it's usually a sign that you need a data structure to hold the elements that those arrays held separately.

Although the ready may std::pair<> is convenient, I would lean towards defining my own struct.
struct WordFrequency
{
    string Word;
    int Frequency;
};

vector<WordFrequency> wordFrequencies;



It should then be easy to iterate over the map, push_back() WordFrequencies into the vector, and then later call sort().


Do you think i should also change the struct i already have for the key below, aswell or is it fine?

 
struct get_first : public std::unary_function<wordCount::value_type, string>
{
    string operator()(const wordCount::value_type& value) const
    {
        return value.first;
    }
};


This post has been edited by donnie7: 01 September 2012 - 09:04 PM

Was This Post Helpful? 0
  • +
  • -

#14 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 3535
  • View blog
  • Posts: 10,944
  • Joined: 05-May 12

Re: Finding common words in a file using lexicographical matching help

Posted 01 September 2012 - 09:29 PM

You could modify that if you really want to keep the transform call.

The point though is that you want to have a structure that keeps the word and its frequency together.
Was This Post Helpful? 0
  • +
  • -

#15 donnie7  Icon User is offline

  • D.I.C Head

Reputation: 0
  • View blog
  • Posts: 55
  • Joined: 12-February 11

Re: Finding common words in a file using lexicographical matching help

Posted 02 September 2012 - 09:14 AM

View PostSkydiver, on 01 September 2012 - 09:29 PM, said:

You could modify that if you really want to keep the transform call.

The point though is that you want to have a structure that keeps the word and its frequency together.

I agree with you, I think its good that i pair the word and frequency together. However, I have really no idea how to define my own struct. We never really went in depth with this in our c++ class. Is this similar to the already made vector pair struct? How do i go about doing this?

This post has been edited by donnie7: 02 September 2012 - 09:23 AM

Was This Post Helpful? 0
  • +
  • -

  • (7 Pages)
  • +
  • 1
  • 2
  • 3
  • Last »