Page 1 of 1

External Sorting with C Rate Topic: -----

#1 Adak  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 331
  • View blog
  • Posts: 1,168
  • Joined: 01-April 11

Posted 30 September 2013 - 04:22 AM

Caution
=======

This is NOT a program you should use for external
sorting. It is study material, about the subject of
external sorting.

This information and program, studies issues related
ONLY to string sorting. Code relating to error handling,
and other subjects required by a robust program, have
been left out, to keep the focus on the subject of
external string sorting.

I'm not stating "Use this program at your own risk".
I'm stating "Don't use this program!". Study it, then
go code up your own external sorting program.


Introduction
============

Sorting a very large amount of data, that can't fit
into your memory at one time, is called external sorting.

It's a classic divide and conquer problem. You can sort
up parts of the data, putting these parts into files, and
then merging them together.

The normal sort and merge design for the data, is:
============================================================
|                                                          |
|       <<<          BIG UNSORTED DATA           >>>       |
|                                                          |
============================================================
 Box files: [#1]     [#2]     [#3]  [#4]  [#5] 
              \        /        \     |    /          
               \      /          \    |   /          
                \    /            \   |  /          
                 \  /              \  | /          
                 Temp               Temp 
                  |                   |
                  |                   |  
              Renamed to #1       Renamed to #2 
                   \                 /
                    \               /
                     \             /
                      \           /
                       \         /
                        \       /
                         \     /
                          \   /
                          Temp
                            |
                            |
                 Renamed to Final File Name


The small numbered boxes above, represent the small data files
you will create one at a time, until all the data is exhausted.
Each of these "box" files will then be merged together, level
by level, until all the data is merged.

Box #5, being the "orphan", can be handled by merging it with
Temp #2, the last Temp file, at the next level up. A better way
is to notice that you have an odd number of files at this level,
and immediately merge the orphan with one of the other box files.

All the sorting is done when the sorted "box" files, have been
written out. After that, it's all merging.

It's important to keep the original data file, and the Final file.
The other files it creates should be removed, as soon as their
data has been merged to a high level. If not, you'll have yet
another complete copy of the data, at every level of your program.


Sorting Algorithms
==================

When you have this much data, you want a top notch sorting
algorithm. Forget your Bubble, Selection, or Insertion sort.

Popular algorithms are: Quicksort, Merge sort, Radix,
Heap sort, and hybrid types (Introsort, etc.). More "exotic"
algorithms exist, but are best avoided. In my opinion, none
of the exotics, even the "cache oblivious" ones, have enough
advantages to recommend them.

I used Quicksort for this example, because it's very fast,
and I felt comfortable working with it.


Hardware
========

External sorting requires the data to be moved around a great
deal. Your data is completely moved or copied:

1) Once when the data is sorted into the box files.

2) Whenever another level of Temp files are created.

3) When the Temp files are merged into the final file.

If you have multiple drives, and can use one for input, and the
second drive for output, that's a winner. SSD drives and disk
arrays offer other options, that might cut your I/O times. If
you can use either of these options, you should consider it.

Test your hardware and sotware, with a sample of your data,
using your program. You can learn a lot in just a short test.

For cpu's, a new 12 core cpu running at 3GHz, will run no faster
than a 2 core cpu running at 3GHz, on a single process sorting
program. More cores won't mean more speed, unless your program
will use those other cores, in parallel.

This program will use a single core. I have a simple rule that
explains why, which I'll share later.


Merging
=======
============================================================
|                                                          |
|       <<<          BIG UNSORTED DATA           >>>       |
|                                                          |
============================================================
Box files     [#1]  [#2]  [#3]  [#4]  [#5]     
                |     |     |     |    |
             =============================
             | #1 through #5 FIFO queues | 
             =============================
                |     |    |     |     |
             ============================= 
             |   Next value in all box   | 
             |   files is compared here. |  
             |   The lowest value moves  | 
             |   to the Final File.      | 
             ============================= 
                          |           
                          |           
                          |       
                     Final File


The above is a KWay merge diagram. The difference is that there
are no Temp files. Each box file is opened, and the least value
in the five box files (the first value since the box files have
already been sorted), is written into the final file.

This looks like such an obvious winner, that you wonder why it's
not more common.***

Consider:

* To avoid continual disk seeks from the files, a buffer is
needed between the box file, and the comparison (the FIFO queue),
which are filled only when they are nearly empty.**

* Each queue needs to fit inside the drive buffer, optimally. If
you have five box files, it's pretty easy to do that. Box files
however, are usually much more numerous - might be in the hundreds
or thousands. If it's flushing out the drive buffer, KWay's
smart design, will have a lackluster reality.

The inner loop of the standard "two-into-one" merge, is very small
and fast. You get the full benefit of a large and fast drive buffer,
using the standard merge algorithm.

** Generally accepted, but may not be true with modern hard drives -
definitely not true with SSD type drives.

*** Also explained later by my simple rule.


Explaining the Critical Code
============================

This loop is in main(), and creates all the box files. After this,
there is no more sorting to be done - just merging.
   do {
      call loadData(fp,data),  //to load BOXSIZE strings of data, 
      returns rowsloaded
   
      quicksort(data, 0, rowsloaded-1) //sort BOXSIZE values
        
      printPackets(data, rowsloaded)  //output one box of sorted strings
                                      //into a numbered box file
                                      
      if rowsloaded>0, increment boxNum //increment # of boxes

   }while rowsloaded=BOXSIZE  

   fclose(data file)
   print Number of box files created: boxNum


Now the merging begins.

In main(), right after the above code creates the box files,
the function to merge 2 files, into 1, is called.

merge2Into1(data, boxNum)

In function merge2Into1:

Variables:

three FILE pointers, two for input, and one for output,
four integers: tier=0,go1=1,go2=1,and orphan=0

char filename1[],filename2[],oldFilename[] sized
as you prefer. I used 25. s1[31], and s2[31]
unsigned long fileNum and lowFileNum set to zero

holds the number of files created at each level
static int files[1000]={0};

The inner loop merges one level's files together.
The outer loop keeps that going, until all the
levels have been merged together.
   do {
      do {
         take the lowFileNum, convert it to a string
         and put it into filename1
      
         repeat with lowFileNum+1, into filename2
         (I used _ultoa() for this)

         Open three files,filename1,filename2,and temp.
         Two for reading, and temp for writing
      
         Minimal error message for fp1-exit(1) on error
       
         if fp2 equal NULL - it's normal to get this with an odd or orphan file
             orphan set to 1
             copy oldFilename into filename2
             open filename2 with fp2
             still an error?  //it's a real error then, exit.
                exit(1)
      
         fpOut, with normal message if error, and exit(1).

         go1=fscanf fp1, s1      //get a value from the first file
         go2=fscanf fp2, s2      //and one from the second file
         
         //this is the major mergine loop
         while go1>0 && go2>0    //check for successful fscanfs
            if strcmp s1,s2 > 0  //and compare them
               fprintf fpOut s2  //if s1 is > output s2
               go2=fscanf fp2 s2 //and get the next value for s2 
            else                
               fprintf fpOut s1  //s1 is < s2, so output s1
               go1=fscanf fp1 s1 //and get the next value for s1
            end else
         end while
         
         //and this block of code handles the run outs
         //when one file is out of data
         if go1>0 && go2<1       //first file was empty
            while go1>0          //while first file has data
               fprintf fpOut s1  //output s1
               go1=fscanf fp1 s1 //get a new s1
            end while
         else if go2>0 && go1<1  //only second file has data
            while go2>0          //while it has data
               fprintf fpOut s2  //output that data
               go2=fscanf fp2 s2 //and get the next value
            end while
         end else         

         fclose fp1,fp2,fpOut    //close all three files
         remove filename1 and filename2 //delete first and second files
         if orphan                      //if there's an orphan file
            rename "temp.dat" filename2 //adjust the filenames for it
         else
            _ultoa fileNum filename1    //create new filename for first file
            rename "temp.dat" filename1 
            strcpy oldFilename, filename1 //from a saved old filename
         end else

         ++fileNum                     
         lowFileNum+=2           //file numbering goes 0&1 into 0, 
                                 //2&3 into 1, 3&4 into 2, etc.
         //file names (all numbers), are reused at each level

      }while lowFileNum less than files[tier] 
      //while not the end of one merge level, loop

      if orphan                    //test for orphans
         --fileNum                 //orphan file is gone, don't  
         set orphan to 0           //try to merge it again.
      end if
      set files[++tier] to fileNum //record how many files were created
                                   //at this level
      set fileNum to 0             //and reset these variables for
      set lowFileNum to 0          //for the next level of merging.
   }end while files[tier]>1   
end of function
 


Performance Testing
===================

These are the results of the runs, sorting 500,000 English
words, initially in random order. Maximum length per word
was 30 chars, plus one for the end of string char.

RUN#    Elapsed time       Rows in theBox files array
========================================================
  #1:     4.29 seconds           500
  #2:     2.49 seconds         5,000
  #3:     2.10 seconds        10,000
  #4      1.84 seconds        20,000
  #5      1.64 seconds        30,000
  #6      1.23 seconds        50,000
  #7      1.07 seconds        75,000
  #8      1.06 seconds       100,000


The largest BOXSIZE shows a 400% improvement, over the
smallest size.


Testing Notes
=============
Testing PC was an Intel i7 940 cpu @3.5GHz, and
one 1TB 7,200 RPM HD, compiled by Pelles C, and run on
Windows 7 OS.

Large word lists frequently include *huge* chemical or made
up "words"*. Before you start sorting out your strings, check
to be sure they do not exceed the limits of your program.

Aside from the above monstrosities - which I consider to be
descriptions or jokes, not words, English has no words with
more than 30 letters in it.

* Many chemical and medical "words" are really not words.
They are descriptions or several words:
polysaccharide should be poly saccharide, for example.

It is a given that you must know the string length maximum
that your program handles.


Conclusion
==========

Remember that your drive(s) must be able to hold, at the
same time, a minimum of:

* One full set of the (raw) data.

* A second full set of sorted data, in the Box files.

* A portion of the data, at the next level or Final
File.

Be sure you have enough drive space, before you begin.

If you have a slow drive, a slow cpu, or a small amount of
RAM, then your hardware will slow your sorting way down.
Faster drives (especially large SSD's and disk arrays),
or just more drives, can really speed up external sorting.

If your algorithm is inefficient, even the best hardware
won't get your sorting up to speed.

I've done nothing to optimize this program yet. Why didn't I
go ahead with the KWay merge sort code? Or the parallel
sorters, using multiple cores? Or some optimizations?

Because I follow the one sip rule:

If the program finishes before you can finish a sip of
coffee - then it's fast enough. ;)/>/>/>

/* 
Notice:
=======

THIS PROGRAM IS FOR EDUCATIONAL PURPOSES *ONLY*! 

* It has not been adequately tested.
* It has not had error handling code included.
* It has not been optimized.

It is NOT a "ready to use" program! Study it! Then code
up your own program. 

Maximum Length of Any Word: 30 chars + end of string char

THIS PROGRAM IS FOR EDUCATIONAL PURPOSES *ONLY*! 

Adak, Sept. 23, 2013. 

*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#define C 31           //length of each row in the data array
#define BOXSIZE 100000 //number of strings in the smallest data array

//function definitions
int loadData(FILE *fp, char data[BOXSIZE][C]);
void merge2into1(char data[BOXSIZE][C], int boxNum);
void printIt(char data[BOXSIZE][C], int rowsLoaded); 
void quicksort(char data[BOXSIZE][C], int lo, int hi);
void printBoxes(char data[BOXSIZE][C], int rowsLoaded);


/* This program is for educational purposes *ONLY*! */
int main(void) {
   int rowsLoaded,boxNum=0;
   static char data[BOXSIZE][C]; //static arrays get memory
   FILE *fp;                     //from the much larger heap
   clock_t start,stop;
   printf("For those thinking of using KWay merging, the maximum number\n\
   \b\b\bof files that you can have open simultaneously, is: %d\n\n",FOPEN_MAX);
   
   start=clock(); 
   
   //open the raw data file 
   fp=fopen("wordsHalfM.txt","r");  
   if(!fp) {
      printf("Error! data file failed to open\n");
      return 1;
   }

   //start loading, sorting and writing out the box files
   printf(" Number of sorted box files created: %6d",boxNum);
   do {
      rowsLoaded=loadData(fp,data);
      
      quicksort(data, 0, rowsLoaded-1);
      
      printBoxes(data, rowsLoaded);

      if(rowsLoaded>0) 
         ++boxNum;
      
      printf("\b\b\b\b\b\b%6d",boxNum);
      fflush(stdout);

   }while(rowsLoaded==BOXSIZE);
   printf("\n\n");
   fclose(fp); 

   //box files are now all sorted and written out

   //merge them all into one final file
   merge2into1(data, boxNum);
   
   stop=clock();
   printf("Elapsed time: %f\n",(float)(stop-start)/CLOCKS_PER_SEC);
   return 0;
}
void merge2into1(char data[BOXSIZE][C], int boxNum) {
   FILE *fp1,*fp2,*fpOut;

   int tier=0,go1=1,go2=1,orphan=0;
   int files[1000]={0};
   unsigned long int fileNum=0,lowFileNum=0; 

   char filename1[25],filename2[25],oldFilename[25],s1[C],s2[C];

   files[0]=boxNum;
   do {
      
      do {
          //check your compiler, Pelles C has a limit of 33 chars
          //ultoa is not standard, but frequently included in compilers
         _ultoa(lowFileNum,filename1,10);
         _ultoa(lowFileNum+1,filename2,10);
         strcat(filename1, ".dat");
         strcat(filename2, ".dat");
      
         fp1=fopen(filename1, "r");
         fp2=fopen(filename2, "r");
         fpOut=fopen("temp.dat","w");

         if(!fp1) { 
            printf("Error! Unable to open fp1 files for merging\n");
            printf("filename1:%s\n",filename1);
            exit(1);
         }
         //!fp2 happens regularly, when there is only 1 file left unmerged
         //at this merge level. This handles the orphan file      
         if(!fp2) {
            orphan=1;
            strcpy(filename2, oldFilename);

            fp2=fopen(filename2, "r"); //merge orphan file with oldFilename
             
            if(!fp2) { //still an error?
               printf("Error! Unable to open fp2 file for merging\n");
               exit(1);
            }
         }
         if(!fpOut) {
            printf("Error! Unable to open temp.dat, to write merge data\n");
            exit(1);
         }
   
         go1=fscanf(fp1,"%s",s1);
         go2=fscanf(fp2,"%s",s2);
   
         //this is the small main merge loop - fast!
         while(go1>0 && go2>0) {
            if((strcmp(s1,s2))>0) {
               fprintf(fpOut,"%s\n",s2);
               go2=fscanf(fp2,"%s",s2);
            }else {
               fprintf(fpOut, "%s\n",s1);
               go1=fscanf(fp1,"%s",s1);
            }
         }

         //handle run outs of data, in either file
         if(go1>0 && go2<1) { 
            while(go1>0){
               fprintf(fpOut,"%s\n",s1);
               go1=fscanf(fp1,"%s",s1);
            }
         }else if(go2>0 && go1<1) {
            while(go2>0){   
               fprintf(fpOut,"%s\n",s2);
               go2=fscanf(fp2,"%s",s2);
            }
         }

         //both input files have been read. Close all three files.
         fclose(fp1); 
         fclose(fp2); 
         fclose(fpOut);
         
         remove(filename1); remove(filename2);
         //create file output name
         if(orphan) {
            rename("temp.dat", filename2);
         }else {   
            _ultoa(fileNum, filename1,10);
            strcat(filename1, ".dat");
            rename("temp.dat", filename1);
            strcpy(oldFilename, filename1);
         }
      
         ++fileNum;
         lowFileNum+=2; //number of files merged so far, on lower level
         //number of files merged so far, on current merge level
      }while(lowFileNum<files[tier]);

      if(orphan) {
         --fileNum;
         orphan=0;
      }
      files[++tier]=fileNum;
      fileNum=0;
      lowFileNum=0;
      
       
   }while(files[tier]>1);
}
void printIt(char data[BOXSIZE][C], int rowsLoaded) { //print rows of data
   int i;
   for(i=0;i<rowsLoaded;i++)
      printf("%s\n",data[i]);

   printf("\n\n");
}
//outputs the box files: 0.dat, 1.dat, etc.
void printBoxes(char data[BOXSIZE][C], int rowsLoaded) { 
   int i;
   FILE *fpOut;
   static unsigned long int n=0;
   char filename[25];
   
   _ultoa(n,filename,10);
   strcat(filename,".dat");
   
   fpOut=fopen(filename,"w");
   if(!fpOut) {
      printf("Error! data file failed to open\n");
      exit(1);
   }
   
   for(i=0;i<rowsLoaded;i++) {
      fprintf(fpOut, "%s\n",data[i]);
   }
   fclose(fpOut);
   ++n;
}

//quicksort Note: parameter hi, on the first call, should 
//be the highest valid index, not the size of the array
void quicksort(char data[BOXSIZE][C], int lo, int hi) { 
   int i=lo, j=hi; 
   char pivot[C],temp[C];
   strcpy(pivot,data[(lo+hi)/2]);

   if(lo <= hi) return; 
   
   // Partition the array 
   do {    
      while ((strcmp(data[i],pivot))<0)i++; 
      while ((strcmp(data[j],pivot))>0)j--;
      
      if (i<=j) {  
         strcpy(temp,data[i]);
         strcpy(data[i],data[j]);
         strcpy(data[j],temp);
         i++;
         j--;
      }
   } while(i<j);
  
   //handles the smaller partition, first
   if (lo < j) quicksort(data,lo, j);
   if (i < hi) quicksort(data,i, hi);
  
}
//loads up the words, into the data  array
int loadData(FILE *fp,char data[BOXSIZE][C]) {
   int i=0,n=1;
   
   while(i<BOXSIZE && n>0) {
      n=fscanf(fp, "%s",data[i]);
      ++i;
   }
   if(i<BOXSIZE)
      data[--i][0]='\0';
   return i;
}



This post has been edited by modi123_1: 06 December 2013 - 05:34 PM
Reason for edit:: fixed formatting


Is This A Good Question/Topic? 1
  • +

Replies To: External Sorting with C

#2 cats_rule  Icon User is offline

  • New D.I.C Head

Reputation: 2
  • View blog
  • Posts: 44
  • Joined: 21-November 10

Posted 24 November 2013 - 08:34 PM

Thank you , very useful. I'm studying C in University so any help is much needed :-)
Was This Post Helpful? 0
  • +
  • -

#3 Adak  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 331
  • View blog
  • Posts: 1,168
  • Joined: 01-April 11

Posted 06 December 2013 - 05:05 PM

The diagrams of the data flow were squished out. Here's how they should appear:

Fig. 1. This is how the program handles large sorting jobs:

============================================================
|                                                          |
|       <<<          BIG UNSORTED DATA           >>>       |
|                                                          |
============================================================
 Box files: [#1]     [#2]     [#3]  [#4]  [#5] 
              \        /        \     |    /          
               \      /          \    |   /          
                \    /            \   |  /          
                 \  /              \  | /          
                 Temp               Temp 
                  |                   |
                  |                   |  
              Renamed to #1       Renamed to #2 
                   \                 /
                    \               /
                     \             /
                      \           /
                       \         /
                        \       /
                         \     /
                          \   /
                          Temp
                            |
                            |
                 Renamed to Final File Name
   


The small numbered boxes above, represent the small data files
you will create one at a time, until all the data is exhausted.
Each of these "box" files will then be merged together, level
by level, until all the data is merged.

Box #5, being the "orphan", can be handled by merging it with
Temp #2, the last Temp file, at the next level up. A better way
is to notice that you have an odd number of files at this level,
and immediately merge the orphan with one of the other box files.

All the sorting is done when the sorted "box" files, have been
written out. After that, it's all merging.

It's important to keep the original data file, and the Final file.
The other files it creates should be removed, as soon as their
data has been merged to a high level. If not, you'll have yet
another complete copy of the data, at every level of your program.



Figure 2. This is a diagram for a more efficient, but more complex, k-way merge sort. Five box files are illustrated, but the real number of files you could open simultaneously, should be much higher, depending on your PC and operating system.

============================================================
|                                                          |
|       <<<          BIG UNSORTED DATA           >>>       |
|                                                          |
============================================================
Box files     [#1]  [#2]  [#3]  [#4]  [#5]     
                |     |     |     |    |
             =============================
             | #1 through #5 FIFO queues | 
             =============================
                |     |    |     |     |
             ============================= 
             |   Next value in all box   | 
             |   files is compared here. |  
             |   The lowest value moves  | 
             |   to the Final File.      | 
             ============================= 
                          |           
                          |           
                          |       
                     Final File




This post has been edited by Adak: 06 December 2013 - 05:08 PM

Was This Post Helpful? 0
  • +
  • -

Page 1 of 1