how to read thousand files to get their bits?

  • (2 Pages)
  • +
  • 1
  • 2

25 Replies - 1624 Views - Last Post: 04 June 2012 - 08:21 AM Rate Topic: -----

#1 TedOla  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 7
  • Joined: 01-June 12

how to read thousand files to get their bits?

Posted 01 June 2012 - 08:45 AM

Dear All

Please share your thoughts and ideas!!!
I have thousands of ASCII text files which I have to parse and get bits out of them all
(e.g.from 1st bit to 888th bit).Anyway of dealing such lot files to analysis ???

void main(int argc, char* argv[]){

			char * buffer;	
			int index;

	for(int ind=1; ind<argc; ind++){

			  ifstream is (argv[ind] );
			  is.seekg (0, ios::beg);
			  buffer = new char [100];
			  is.read (buffer,100);

		switch(ind){

			case 1:
			    show(buffer,100);
                                
			    delete[] buffer;
				break;
			case 2:
			    show1(buffer,100);
			    delete[] buffer;
				break;
  ........................................
  ........................................
  ........................................
                        case 1000
         		    show999(buffer,100);
			    delete[] buffer;
			    break;
 
                 }
}
void show(){

    Turning characters into bits of 1st file and push_back into a vector.
}
.......
.......
void show999(){

    Turning characters into bits of 999st file and push_back into a new vector.
}





I used above to push bits of each file into a vector, but it requires thousands of vectors to be defined.
I had to make thousands line to code,which was bad coding.
providing any hint of reading and dealing with such a big number of files will be very helpful. Please shed some lights on it!!

Please be advised this is NOT homework or any project.
Thanks in advance for your attention.

Is This A Good Question/Topic? 0
  • +

Replies To: how to read thousand files to get their bits?

#2 Salem_c  Icon User is offline

  • void main'ers are DOOMED
  • member icon

Reputation: 1689
  • View blog
  • Posts: 3,209
  • Joined: 30-May 10

Re: how to read thousand files to get their bits?

Posted 01 June 2012 - 09:08 AM

So why can't you define std::vector< std::vector< someType > > allMyData;

> Please be advised this is NOT homework or any project.
I fail to see your motivation for doing this then.
Unless you mean it's your "job", which is perhaps the worst thing you could say.
Was This Post Helpful? 0
  • +
  • -

#3 TedOla  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 7
  • Joined: 01-June 12

Re: how to read thousand files to get their bits?

Posted 01 June 2012 - 09:32 AM

My first try here, how could sb make unreleased command without even knowing prom...
Was This Post Helpful? 0
  • +
  • -

#4 jimblumberg  Icon User is online

  • member icon


Reputation: 4099
  • View blog
  • Posts: 12,690
  • Joined: 25-December 09

Re: how to read thousand files to get their bits?

Posted 01 June 2012 - 09:55 AM

Quote

My first try here, how could sb make unreleased command without even knowing prom...

What??? You need to ask questions that make sense.

Quote

I used above to push bits of each file into a vector, but it requires thousands of vectors to be defined.

Why would you need thousands of vectors? Why would you need thousands of lines of code?

Why are you using C-strings instead of std::string?

For any reasonable answers you will need to provide more detail. What do your files look like? What information are you trying to extract from these files?

You will also need to show some actual code, not just a bunch of pseudo code.

Jim
Was This Post Helpful? 1
  • +
  • -

#5 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 3590
  • View blog
  • Posts: 11,166
  • Joined: 05-May 12

Re: how to read thousand files to get their bits?

Posted 01 June 2012 - 10:52 AM

View Postjimblumberg, on 01 June 2012 - 09:55 AM, said:

Why are you using C-strings instead of std::string?


I didn't see any C strings in his original post other than using argv[ind]. Sure there was a new char[100], but it is not null terminated like a C string, because he fills it using ifstream::read(). So buffer is just a buffer and not a C string.

Is there another set of lines you are referring to?

View PostTedOla, on 01 June 2012 - 08:45 AM, said:

I have thousands of ASCII text files which I have to parse and get bits out of them all
(e.g.from 1st bit to 888th bit).


You'll have problems getting to bits 801 through 888 since your buffer is only 100 bytes big and you are only reading 100 bytes.
Was This Post Helpful? 1
  • +
  • -

#6 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 3590
  • View blog
  • Posts: 11,166
  • Joined: 05-May 12

Re: how to read thousand files to get their bits?

Posted 01 June 2012 - 10:59 AM

View PostSalem_c, on 01 June 2012 - 09:08 AM, said:

So why can't you define std::vector< std::vector< someType > > allMyData;

> Please be advised this is NOT homework or any project.
I fail to see your motivation for doing this then.
Unless you mean it's your "job", which is perhaps the worst thing you could say.


He could be scanning his MP3, photo collection, or other files for dupes. But then I would classify it as a home or pet project.

Anyway, good advice on the the vector of vectors. In particular one approach could be to use the sometimes maligned vector<bool> like:
std::vector< std::vector<bool> > fileBits;



On the other hand, if all he needs are the raw bytes, why not just memory map the first 111 bytes of each file if he needs all the bits accessible all the time. Otherwise, wouldn't it be better to analyze each file as it comes along.
Was This Post Helpful? 2
  • +
  • -

#7 Salem_c  Icon User is offline

  • void main'ers are DOOMED
  • member icon

Reputation: 1689
  • View blog
  • Posts: 3,209
  • Joined: 30-May 10

Re: how to read thousand files to get their bits?

Posted 01 June 2012 - 11:12 AM

To answer the OP (who decided to engage in private messaging), there are three kinds of 'work'.

A "project" is something you choose to do for yourself, for whatever reason (enlightenment, curiosity, hope of being paid (see "job") or whatever).

A "job" is where you're in gainful employment to implement some bit of s/w against some requirement, and the reward is being paid (and continued employment).

A "homework" is a test of knowledge acquired (as set by your tutor), and the reward is a grade (and continued attendance on the course).
Was This Post Helpful? 3
  • +
  • -

#8 TedOla  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 7
  • Joined: 01-June 12

Re: how to read thousand files to get their bits?

Posted 01 June 2012 - 06:37 PM

It is part of research, I need to make probability analysis of occurrence of '0' at certain positions of each file out of thousand. So the question is (1st) how to read big amount of text files??? (2nd) what are the ways of getting all bits of these thousand-file for the analysis??

From my unexperience of coding skills, I feel that
int main(int argc, const char* argv[]) {

    for (int i = 0; i < argc; ++i) {
        std::cout << argv[i] << std::endl;
    }
    std::cin.get();
    return 0;
}



is NOT efficient way of reading thousand-file, thus, I am kindly begging for shedding some light on the prom.
Thanks for reading this, and please share your ideas..
Was This Post Helpful? 0
  • +
  • -

#9 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 3590
  • View blog
  • Posts: 11,166
  • Joined: 05-May 12

Re: how to read thousand files to get their bits?

Posted 01 June 2012 - 07:38 PM

If it is only particular bits at in all the files, it may make more sense to just seek to that byte offset, read the byte, then extract the bit, and then increment your counter. There is no need to store a vector of bit vectors.



Sent from my T-Mobile G2 using Tapatalk 2
Was This Post Helpful? 2
  • +
  • -

#10 David W  Icon User is offline

  • DIC supporter
  • member icon

Reputation: 281
  • View blog
  • Posts: 1,788
  • Joined: 20-September 08

Re: how to read thousand files to get their bits?

Posted 01 June 2012 - 08:53 PM

Further to Skydiver's great answers ...

You could create/use a text file to to hold all of your 1000's of file names ... that you wish to process (one name per line).

Then read that file into a vector (or list) of C++ string

Then traverse that vector of string to access the name of each of the files to sample ... as per Skydiver's suggestion.

This post has been edited by David W: 01 June 2012 - 09:01 PM

Was This Post Helpful? 1
  • +
  • -

#11 Salem_c  Icon User is offline

  • void main'ers are DOOMED
  • member icon

Reputation: 1689
  • View blog
  • Posts: 3,209
  • Joined: 30-May 10

Re: how to read thousand files to get their bits?

Posted 01 June 2012 - 09:21 PM

> From my unexperience of coding skills, I feel that <code> is NOT efficient way of reading thousand-file
Well you could also do
- opendir
- while( readdir() )
- closedir
to read all the filenames in a directory, and then process those filenames.

But doing that means you're just replacing what your command line shell does when you type in
myprog *.txt
Marginally 'quicker' for the shell to start your program, but extra work for you.


The real "inefficiency" comes from
- open
- while ( read() )
- close
on ever single filename you want to process.

Stop worrying about the filenames and start focusing on the problem of what you need to do with the data in each file.
Was This Post Helpful? 2
  • +
  • -

#12 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 3590
  • View blog
  • Posts: 11,166
  • Joined: 05-May 12

Re: how to read thousand files to get their bits?

Posted 01 June 2012 - 09:45 PM

Once you figure out what you want to do with the files, there are always options to have multiple threads and/or processes going since I/O is slow. All that is for later.

So do you really need all of the first 888 bits of your set of files to sample, or are there only particular bits of those 888 that you are concerned about?

Let me tell you right now that if you are only interested in the MSB of each byte, and all your files are UTF-8 encoded text files created in the United States, I'm willing to bet $100 that over 98% of them will have the value of zero in that bit positions.
Was This Post Helpful? 2
  • +
  • -

#13 TedOla  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 7
  • Joined: 01-June 12

Re: how to read thousand files to get their bits?

Posted 01 June 2012 - 10:35 PM

Please be patient with unskilled me little bit. Here is idea (sorry I should have mentioned earlier).
01011010100101010100101101010010000000000111111110101 (1st file bits)
000111110101010101001010000000000100101000001110101
010110101001010101010100101000000000000000111111110101
10101001010001010110101000000000000111111110101

............

00101001000010100101000000000000000111111110101 (1000th file bits)
I should calculate k/1000 (k:number of'0' in each column), going through each column, say until 8888th column.

solution;
I can push the bits of each line that represents a file into a vector, then thousand vectors are essential. How to fix the problem?
Sorry taking so much of your time, help further please!!!!
Was This Post Helpful? 0
  • +
  • -

#14 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 3590
  • View blog
  • Posts: 11,166
  • Joined: 05-May 12

Re: how to read thousand files to get their bits?

Posted 01 June 2012 - 11:22 PM

If that is all you need, you don't need thousands of vectors. You just need one vector of 8888 ints.

You can simply do the following:
const c_maxBitsToSample = 8888;
vector<int> count0Bit(c_maxBitsToSample);
int fileCount = 0;

foreach(file in files to be sampled)
{
    BitReader reader(file);
    fileCount++;
    for(int i = 0; i < c_maxBitsToSample; ++i)
    {
        bool bit = true;

        if (!reader.GetNextBit(&bit))
            break;    // We've hit the end of the file

        if (!bit)
            count0Bit[i]++;
    }
}

for(int i = 0; i < c_maxBitsToSample; ++i)
    cout << "k/" << fileCount << " for bit position " << i << ": " << count0Bit[i] / (float) fileCount << endl;

:

class BitReader
{
public:
    BitReader(string filename)
    {
        int bufferSize = c_maxBitsToSample / 8 + 1;
        m_buffer = new char[bufferSize];
        ifstream ifs(filename);
        ifs.read(m_buffer, bufferSize);
        m_size = ifs.gcount();
        m_byteIndex = 0;
        m_bitmask = 0x80;
    }

    ~BitReader()
    {
        delete [] m_buffer;
    }

    bool GetNextBit(bool * bitRead)
    {
        if (m_byteIndex >= m_size)
            return false;

        *bitRead = !!(m_buffer[m_byteIndex] & m_bitmask);
        m_bitmask >>= 1;
        if (!m_bitmask)
        {
            m_bitmask = 0x80;
            m_byteIndex++;
        }
        return true;
    }
private:
    char * m_buffer;
    int m_size;
    int m_byteIndex;
    char m_bitmask;
};


This post has been edited by Skydiver: 01 June 2012 - 11:35 PM

Was This Post Helpful? 2
  • +
  • -

#15 David W  Icon User is offline

  • DIC supporter
  • member icon

Reputation: 281
  • View blog
  • Posts: 1,788
  • Joined: 20-September 08

Re: how to read thousand files to get their bits?

Posted 02 June 2012 - 02:03 PM

View PostSkydiver, on 02 June 2012 - 02:22 AM, said:

If that is all you need, you don't need thousands of vectors. You just need one vector of 8888 ints.

You can simply do the following:
const c_maxBitsToSample = 8888;
vector<int> count0Bit(c_maxBitsToSample);
int fileCount = 0;

foreach(file in files to be sampled)
{
    BitReader reader(file);
  fileCount++;
    for(int i = 0; i < c_maxBitsToSample; ++i)
    {
        bool bit = true;

        if (!reader.GetNextBit(&bit))
            break;    // We've hit the end of the file

        if (!bit)
            count0Bit[i]++;
    }
}

for(int i = 0; i < c_maxBitsToSample; ++i)
    cout << "k/" << fileCount << " for bit position " << i << ": " << count0Bit[i] / (float) fileCount << endl;

:

class BitReader
{
public:
    BitReader(string filename)
    {
        int bufferSize = c_maxBitsToSample / 8 + 1;
        m_buffer = new char[bufferSize];
        ifstream ifs(filename);
        ifs.read(m_buffer, bufferSize);
        m_size = ifs.gcount();
        m_byteIndex = 0;
        m_bitmask = 0x80;
    }

    ~BitReader()
    {
        delete [] m_buffer;
    }

    bool GetNextBit(bool * bitRead)
    {
        if (m_byteIndex >= m_size)
            return false;

        *bitRead = !!(m_buffer[m_byteIndex] & m_bitmask);
        m_bitmask >>= 1;
        if (!m_bitmask)
        {
            m_bitmask = 0x80;
            m_byteIndex++;
        }
        return true;
    }
private:
    char * m_buffer;
    int m_size;
    int m_byteIndex;
    char m_bitmask;
};



Nice example of bit reader...
but could you explain why you use a double not '!!' above?
Why not pass by reference in:
bool GetNextBit(bool * bitRead)
I think you meant to code
ifstream ifs( filename.c_str() ); // not: ifstream ifs(filename);
Or ... Maybe better to have coded ...
BitReader(const char* filename) // instead of: BitReader(string filename)

And Salem_c's idea of ...
myprog *.txt
on the command line or in a batch file is great ... Much better than creating a file of file names.
Was This Post Helpful? 0
  • +
  • -

  • (2 Pages)
  • +
  • 1
  • 2