Word sorting program

  • (10 Pages)
  • +
  • 1
  • 2
  • 3
  • Last »

140 Replies - 6456 Views - Last Post: 04 November 2013 - 03:14 PM Rate Topic: -----

#1 lewm  Icon User is offline

  • D.I.C Head

Reputation: 7
  • View blog
  • Posts: 160
  • Joined: 29-March 13

Word sorting program

Posted 28 August 2013 - 11:56 AM

Hi guys, i have made a program that reads from a file (Words), It then checks through the file writing all the words except duplicates to a new file (SortedWords), it seems to work fine for what i wanted it for the only problem is its very slow, i have 200,000 words.

How can i make it faster? I thought about splitting the file in half and then running it so it reads from 2 files at once and then writes the results to one file.

Any other ideas?
#include <stdlib.h>
#include <stdio.h>
#include <windows.h>
#include <math.h>
#include <string.h>
#include <dos.h>
#include <time.h>

void erase_duplicates(void);
long int hm(void);

int main(void)
{
    erase_duplicates();
    return 0;
}

void erase_duplicates(void)
{
    FILE *p, *po;
    char w[50]={0}, ww[50]={0};
    long int i=0, j=0, duplicate_word=0;

    po=fopen("SortedWords","w");
    fclose(po);
    p=fopen("Words","r");
    fscanf(p, "%s", &w);
    i=1;
    po=fopen("SortedWords","a");
    do
    {
        duplicate_word=0;
        do
        {
            fscanf(p, "%s", &ww);
            if(strcmp(w,ww)==0)
            {
                duplicate_word=1;
            }
        }while(ww[0]!='*' && duplicate_word!=1);
        i++;
        if(!duplicate_word)
        {
            fprintf(po, "%s\n",w);
        }
        fclose(p);
        p=fopen("Words","r");
        for(j=0;j<i;j++)
        {
            fscanf(p, "%s", &w);
        }
    }while(w[0]!='*');
    fprintf(po, "*");
    fclose(p);
    fclose(po);
    return;
}


Is This A Good Question/Topic? 0
  • +

Replies To: Word sorting program

#2 jjl  Icon User is offline

  • Engineer
  • member icon

Reputation: 1072
  • View blog
  • Posts: 4,532
  • Joined: 09-June 09

Re: Word sorting program

Posted 28 August 2013 - 12:04 PM

Quote

How can i make it faster? I thought about splitting the file in half and then running it so it reads from 2 files at once and then writes the results to one file.


Files are sequentially accessed, you cannot read two areas of a file simultatiously.

Your performance is bad due to your linear look up table for your duplicate words. Your big O complexity for your algorithm is currently O(M * N), where M is the number of words in the file, and N is the size of the duplicate look up table.

Hash tables provide constant look up time and would drop the big O complextiy to O(M). I recommend using one for your lookup table. Since however, you are using C (and not C++), you will have to find a hash table implementation externally, or make one yourself.

http://en.wikipedia....wiki/Hash_table

This post has been edited by jjl: 28 August 2013 - 12:07 PM

Was This Post Helpful? 1
  • +
  • -

#3 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 3489
  • View blog
  • Posts: 10,748
  • Joined: 05-May 12

Re: Word sorting program

Posted 28 August 2013 - 12:06 PM

It is slow because you keep closing and reopening files, as well doing multiple scans through your source file. Although you could get a speed boost by using memory mapped files, that still doesn't fix the fact that your algorithm is just plain old inefficient.

If you have enough memory, I suggest reading in all the words, sorting, and then writing out out skipping duplicates. If you don't have enough memory, then it gets more complicated: You can chunk the file into what does fit into memory and sort each chunk. Once all the chunks are sorted, perform a merge sort, and while in the process of doing the merge sort, only accept the unique words.
Was This Post Helpful? 2
  • +
  • -

#4 jimblumberg  Icon User is offline

  • member icon


Reputation: 3993
  • View blog
  • Posts: 12,323
  • Joined: 25-December 09

Re: Word sorting program

Posted 28 August 2013 - 12:07 PM

One thing that would really help increase the speed would be to quit continually opening and closing your files in the loop.

Next naming your variables with meaningful names would help outsiders trying to follow your logic.

Next the following snippet seems like a real time waster:

    do
    {
...
        fclose(p);
        p=fopen("Words","r");
        for(j=0;j<i;j++)
        {
            fscanf(p, "%s", &w);
        }


Jim
Was This Post Helpful? 1
  • +
  • -

#5 vividexstance  Icon User is online

  • D.I.C Lover
  • member icon

Reputation: 651
  • View blog
  • Posts: 2,234
  • Joined: 31-December 10

Re: Word sorting program

Posted 28 August 2013 - 12:08 PM

There's more than a few ways to improve this code. First I would recommend better names for variables and you shoulde check the return value of some of those file stream function calls, like the call to fopen() which returns NULL if the file wasn't opened and it sets the errno variable to its appropriate value.

To improve the program's efficiency, you could read the file's contents into memory with one function call, and then work with what you have in memory. I think this would definitely speed things up because you would no longer be repeating calls to read from the file which can take some time.
Was This Post Helpful? 1
  • +
  • -

#6 lewm  Icon User is offline

  • D.I.C Head

Reputation: 7
  • View blog
  • Posts: 160
  • Joined: 29-March 13

Re: Word sorting program

Posted 28 August 2013 - 12:11 PM

View Postjjl, on 28 August 2013 - 12:04 PM, said:

Quote

How can i make it faster? I thought about splitting the file in half and then running it so it reads from 2 files at once and then writes the results to one file.


Files are sequentially accessed, you cannot read two areas of a file simultatiously.

Your performance is bad due to your linear look up table for your duplicate words. Your big O complexity for your algorithm is currently O(M * N), where M is the number of words in the file, and N is the size of the duplicate look up table.

Hash tables provide constant look up time and would drop the big O complextiy to O(M). I recommend using one for your lookup table. Since however, you are using C (and not C++), you will have to find a hash table implementation externally, or make one yourself.

Thank you, i know you cant read from the same file at the same time, thats why i said i thought about splitting the file into two separate ones. The rest of what you said has gone right over my head, ill have to look into what you just said.

View Postjimblumberg, on 28 August 2013 - 12:07 PM, said:

One thing that would really help increase the speed would be to quit continually opening and closing your files in the loop.

Next naming your variables with meaningful names would help outsiders trying to follow your logic.

Next the following snippet seems like a real time waster:

    do
    {
...
        fclose(p);
        p=fopen("Words","r");
        for(j=0;j<i;j++)
        {
            fscanf(p, "%s", &w);
        }


Jim

I keep closing the file and opening it to go back to the start of the file. Have a closer look at what ive tried to do. What you said makes sense though thanks.
Was This Post Helpful? 0
  • +
  • -

#7 vividexstance  Icon User is online

  • D.I.C Lover
  • member icon

Reputation: 651
  • View blog
  • Posts: 2,234
  • Joined: 31-December 10

Re: Word sorting program

Posted 28 August 2013 - 12:13 PM

View Postlewm, on 28 August 2013 - 03:11 PM, said:

I keep closing the file and opening it to go back to the start of the file. Have a closer look at what ive tried to do. What you said makes sense though thanks.

This is unnecessary, have a look at this page: fseek is your friend. That function allows you to move throughout the file.
Was This Post Helpful? 1
  • +
  • -

#8 jjl  Icon User is offline

  • Engineer
  • member icon

Reputation: 1072
  • View blog
  • Posts: 4,532
  • Joined: 09-June 09

Re: Word sorting program

Posted 28 August 2013 - 12:19 PM

I didn't see that you were reopening files to reset the file pointers. I suggest reading the files into main memory (an array of characters) before processing. Disk IO is hurting your performance and you are doing it roughly 100% of the process.
Was This Post Helpful? 1
  • +
  • -

#9 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 3489
  • View blog
  • Posts: 10,748
  • Joined: 05-May 12

Re: Word sorting program

Posted 28 August 2013 - 12:20 PM

View Postlewm, on 28 August 2013 - 03:11 PM, said:

Thank you, i know you cant read from the same file at the same time, thats why i said i thought about splitting the file into two separate ones. The rest of what you said has gone right over my head, ill have to look into what you just said.


Actually, you can. It's a matter of opening the file such that its locking more is set up for shared reads, and then you can have more than one file pointer to the same file. I didn't suggest going this route in my earlier post, though, because that all that would help was trimming away the time to reopen the file, but nothing to improve the algorithm overall.
Was This Post Helpful? 1
  • +
  • -

#10 jjl  Icon User is offline

  • Engineer
  • member icon

Reputation: 1072
  • View blog
  • Posts: 4,532
  • Joined: 09-June 09

Re: Word sorting program

Posted 28 August 2013 - 12:23 PM

I don't think you would see a noticable performance improvement over reading all of the file data into a chunk of memory before hand. But you can always gprof and profile your code and see where the real bottle neck is.
Was This Post Helpful? 0
  • +
  • -

#11 lewm  Icon User is offline

  • D.I.C Head

Reputation: 7
  • View blog
  • Posts: 160
  • Joined: 29-March 13

Re: Word sorting program

Posted 28 August 2013 - 12:25 PM

View Postjjl, on 28 August 2013 - 12:23 PM, said:

I don't think you would see a noticable performance improvement over reading all of the file data into a chunk of memory before hand. But you can always gprof and profile your code and see where the real bottle neck is.

gprof?
Was This Post Helpful? 0
  • +
  • -

#12 jimblumberg  Icon User is offline

  • member icon


Reputation: 3993
  • View blog
  • Posts: 12,323
  • Joined: 25-December 09

Re: Word sorting program

Posted 28 August 2013 - 12:27 PM

Not only is it unnecessary to open and close the file it is also very error prone. You don't seem to understand that just closing and opening the file doesn't reset any stream error states that one of your file operations may have caused.

Also continually re-reading the data from the beginning is probably the biggest slowdown in your code.

You have two files, are both of of these files sorted? If not sorting both files would probably speed up your program.

Are both of these files the same size? If not what are the sizes of both of these files?

With a little care you should be able to do this in memory, instead of trying to do disk based comparisons.

Jim
Was This Post Helpful? 1
  • +
  • -

#13 vividexstance  Icon User is online

  • D.I.C Lover
  • member icon

Reputation: 651
  • View blog
  • Posts: 2,234
  • Joined: 31-December 10

Re: Word sorting program

Posted 28 August 2013 - 12:29 PM

View Postjjl, on 28 August 2013 - 03:23 PM, said:

I don't think you would see a noticable performance improvement over reading all of the file data into a chunk of memory before hand. But you can always gprof and profile your code and see where the real bottle neck is.

Who was this directed to? And why did you change your mind so quick from this post:

jjl said:

I didn't see that you were reopening files to reset the file pointers. I suggest reading the files into main memory (an array of characters) before processing. Disk IO is hurting your performance and you are doing it roughly 100% of the process.

Was This Post Helpful? 0
  • +
  • -

#14 lewm  Icon User is offline

  • D.I.C Head

Reputation: 7
  • View blog
  • Posts: 160
  • Joined: 29-March 13

Re: Word sorting program

Posted 28 August 2013 - 12:32 PM

Thanks guysyouve helped me loads on this one, i need to re-write it, ill post my new version soon for you guys to pull apart :).

This post has been edited by lewm: 28 August 2013 - 12:36 PM

Was This Post Helpful? 0
  • +
  • -

#15 jjl  Icon User is offline

  • Engineer
  • member icon

Reputation: 1072
  • View blog
  • Posts: 4,532
  • Joined: 09-June 09

Re: Word sorting program

Posted 28 August 2013 - 01:47 PM

Quote

Who was this directed to? And why did you change your mind so quick from this post:

Skydiver... I was saying that shared reads would not have a huge performance gain over loading the file into a single buffer, but I think he was just letting me know that it is possible to do.

I didn't think he was reopening files to reiterate over them, I though it was simply a single pass - therefore reading them into memory would not increase performance. Now that I know he is doing several iterations over the file, I recommend loading the file into memory before processing.

This post has been edited by jjl: 28 August 2013 - 01:48 PM

Was This Post Helpful? 0
  • +
  • -

  • (10 Pages)
  • +
  • 1
  • 2
  • 3
  • Last »