9 Replies - 8738 Views - Last Post: 06 October 2012 - 08:25 AM Rate Topic: -----

#1 renz.bagaporo  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 12
  • Joined: 19-July 12

Reading a large text file efficiently in C++

Posted 05 October 2012 - 05:35 PM

So I am building a word game, and I have a text file with the list of English Words on it. Naturally, it has thousands of lines.

My current code for reading the file onto memory is:
	string currentWord = "";
        vector <string> wordsDB_mem;
	while(!wordsDB.eof())
	{
		getline(wordsDB, currentWord);
		wordsDB_mem.push_back(currentWord);
	}



It is taking too long. Is there a more efficient approach to my problem?

Is This A Good Question/Topic? 0
  • +

Replies To: Reading a large text file efficiently in C++

#2 Skydiver  Icon User is offline

  • Code herder
  • member icon

Reputation: 3457
  • View blog
  • Posts: 10,665
  • Joined: 05-May 12

Re: Reading a large text file efficiently in C++

Posted 05 October 2012 - 06:49 PM

First question: Do you really need all the words in memory at the same time?

Second question: If you don't need it all in memory at the same time, why not use a database?

Third question: Are you willing to do your own file parsing instead of using getline()? If yes, consider using a memory mapped file and just having an index of the beginning of each word.

But as quick answer to your question. One way to speed things up, is if you know roughly how many words there are in the file, call vector.reserve() with that rough number. It will let the vector more efficiently allocate space at the very beginning instead of growing it's array several times through the process of loading in the words.
Was This Post Helpful? 0
  • +
  • -

#3 renz.bagaporo  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 12
  • Joined: 19-July 12

Re: Reading a large text file efficiently in C++

Posted 05 October 2012 - 06:52 PM

View PostSkydiver, on 05 October 2012 - 06:47 PM, said:

First question: Do you really need all the words in memory at the same time?

Second question: If you don't need it all in memory at the same time, why not use a database?

Third question: Are you willing to do your own file parsing instead of using getline()? If yes, consider using a memory mapped file and just having an index of the beginning of each word.



1. I figured loading it onto memory because of that same reason: reading the text file is so slow. So the next time I will need a word from the list, I can use the one loaded in memory so that it'd be faster.

2. No database please.

3. Yes, although I am not familiar with the method. Can you please site some resources where I can read up on the topic?

Thanks.
Was This Post Helpful? 0
  • +
  • -

#4 Skydiver  Icon User is offline

  • Code herder
  • member icon

Reputation: 3457
  • View blog
  • Posts: 10,665
  • Joined: 05-May 12

Re: Reading a large text file efficiently in C++

Posted 05 October 2012 - 07:01 PM

What is your platform? Windows or Linux? They have different APIs.

Or if you want to be platform independent, you can just read the entire file into memory as one big blob. The downside is that you don't get the operating system speed boost of reading the file into memory if you had used a memory mapped file. On the other hand, you'll at least get help from the operating system as it knows that you trying to read a large amount of data versus whatever minimal buffering it provides when you were reading a line at a time.
Was This Post Helpful? 0
  • +
  • -

#5 Skydiver  Icon User is offline

  • Code herder
  • member icon

Reputation: 3457
  • View blog
  • Posts: 10,665
  • Joined: 05-May 12

Re: Reading a large text file efficiently in C++

Posted 05 October 2012 - 07:27 PM

Also consider the cost of your current code:
string currentWord = "";
vector <string> wordsDB_mem;
while(!wordsDB.eof())
{
    getline(wordsDB, currentWord);
    wordsDB_mem.push_back(currentWord);
}



On line 5, a naive implementation of string, will deallocate the last string that was held in currentWord, and allocate new memory to hold the new word read in. Even a relatively smart implementation will need to allocate new memory if the new word being read in is longer than the current word.

The push_back() operation on line 6 will create another instance of string, allocate memory for that string and copy the value from currentWord.

You maybe able to get a quick speed boost simply by skipping one set of allocate and copy by changing things very slightly:
vector <string *> wordsDB_mem;
wordsDB_mem.reserve(NumberOfWordsInFile); 
while(!wordsDB.eof())
{
    string * currentWord = new string;    
    getline(wordsDB, *currentWord);
    wordsDB_mem.push_back(currentWord);
}


This post has been edited by Skydiver: 05 October 2012 - 07:29 PM

Was This Post Helpful? 0
  • +
  • -

#6 Oler1s  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 1395
  • View blog
  • Posts: 3,884
  • Joined: 04-June 09

Re: Reading a large text file efficiently in C++

Posted 05 October 2012 - 07:34 PM

Does no one see the problem with eof() before getline? (Hint: What do you think eof does? How do you know eof does this?)

This post has been edited by Oler1s: 05 October 2012 - 07:34 PM

Was This Post Helpful? 0
  • +
  • -

#7 Skydiver  Icon User is offline

  • Code herder
  • member icon

Reputation: 3457
  • View blog
  • Posts: 10,665
  • Joined: 05-May 12

Re: Reading a large text file efficiently in C++

Posted 05 October 2012 - 09:41 PM

I was actually waiting for jimblumberg to jump on that one and explain how that will end up doing a double read of the last line of the file. :)
Was This Post Helpful? 0
  • +
  • -

#8 Salem_c  Icon User is offline

  • void main'ers are DOOMED
  • member icon

Reputation: 1621
  • View blog
  • Posts: 3,079
  • Joined: 30-May 10

Re: Reading a large text file efficiently in C++

Posted 05 October 2012 - 10:02 PM

> It is taking too long. Is there a more efficient approach to my problem?
Define "too long".

Start analysing where the time is really being spent before you go off trying optimisations which can introduce a whole range of bugs.

#include <iostream>
#include <fstream>
#include <vector>

using namespace std;

int main ( ) {
    string currentWord = "";
    ifstream wordsDB("foo.txt");
    vector <string> wordsDB_mem;
    int lines = 0;
    wordsDB_mem.reserve(10000000);              //!! stage 3
    while(getline(wordsDB, currentWord)) {
        wordsDB_mem.push_back(currentWord);     //!! stage 2
        lines++;
    }
    cout << "Number of lines in vector = " << wordsDB_mem.size() << endl;
    cout << "Number of lines counted = " << lines << endl;
    return 0;
}


Step 1a is find out how long it takes just to read the file - that is, calling getline() and nothing else.
Step 1b is find out how long it takes to just push_back() a large number of strings into an empty vector.
Step 1c as 1b, but with an initial size for the vector.

Here are my combined results.
$ wc foo.txt
  4640000  22080000 113410000 foo.txt

# getline()
$ g++ foo.cpp
$ time ./a.out 
Number of lines counted = 4640000

real	0m0.197s
user	0m0.176s
sys	0m0.020s

# getline() + vector
$ g++ foo.cpp
$ time ./a.out 
Number of lines in vector = 4640000
Number of lines counted = 4640000

real	0m0.901s
user	0m0.788s
sys	0m0.108s

# getline() + vector + reserve
$ g++ foo.cpp
$ time ./a.out 
Number of lines in vector = 4640000
Number of lines counted = 4640000

real	0m0.689s
user	0m0.592s
sys	0m0.088s



Now if just getline() in an empty loop is "too slow", then anything else you're doing is just rearranging deckchairs on the titanic.

Other "gotcha's"
Your OS and file system are not idle participants either. An "improvement" in your code could be down to the file being cached somewhere (hey, you're reading it a lot - better start caching it).
Was This Post Helpful? 3
  • +
  • -

#9 baavgai  Icon User is online

  • Dreaming Coder
  • member icon

Reputation: 5776
  • View blog
  • Posts: 12,590
  • Joined: 16-October 07

Re: Reading a large text file efficiently in C++

Posted 06 October 2012 - 01:52 AM

View Postrenz.bagaporo, on 05 October 2012 - 09:52 PM, said:

So the next time I will need a word from the list, I can use the one loaded in memory so that it'd be faster.


Do you need a word from the list as part of a running process or are you pausing for anything, like user interaction? How often does "next time" come up?

Yes, memory IO is considerably faster can file IO, but it is relative. Pulling a random value from a very large file can be optimized. Loading that entire file into memory, ultimately, can't.
Was This Post Helpful? 0
  • +
  • -

#10 jimblumberg  Icon User is online

  • member icon


Reputation: 3988
  • View blog
  • Posts: 12,303
  • Joined: 25-December 09

Re: Reading a large text file efficiently in C++

Posted 06 October 2012 - 08:25 AM

I myself would use something like:
#include <iostream>
#include <fstream>
#include <string>
#include <vector>

using namespace std;

int main()
{
   ifstream wordsDB("../wordlist/2of12inf.txt");
   if(!wordsDB)
      return(1);
   //string currentWord;
   size_t vecSize = 1000;
   size_t increment = 1000;
   vector <string> wordsDB_mem(vecSize);

   size_t i = 0;
   while(getline(wordsDB, wordsDB_mem[i++]))
   {
      if(i == vecSize-1)
      {
         vecSize += increment;
         wordsDB_mem.resize(vecSize);
      }

   }
   // Remove the extra blank elements.
   wordsDB_mem.resize(i);

   return(0);
}


This would eliminate the "extra" string operation, drastically cut the number of vector reallocations without drastically increasing the size of the vector. This version took about .052 seconds to read the file of 81537 strings on my machine.

Quote

I was actually waiting for jimblumberg to jump on that one and explain how that will end up doing a double read of the last line of the file.

Don't you think I never sleep??? Or have other amusements???

Jim

This post has been edited by jimblumberg: 06 October 2012 - 08:31 AM

Was This Post Helpful? 1
  • +
  • -

Page 1 of 1