Page 1 of 1

Processing Data Held In A Comma Separated File Rate Topic: -----

#1 Martyn.Rae  Icon User is offline

  • The programming dinosaur
  • member icon

Reputation: 545
  • View blog
  • Posts: 1,420
  • Joined: 22-August 09

Posted 21 July 2017 - 01:03 PM

Introduction

This tutorial concerns itself with processing a comma separated data file (say for example from a spreadsheet). We will make the presumption that each record contains 5 fields and each field is double quoted (as in "field 1", "field 2", ... etc, and there are exactly one million records we need to process. Hmm-mm ....if we want to see the results of the endeavour whilst we are still alive, we have to write super efficient code!

Reading the Comma Separated File

OK, the first thing we are going to do is read the entire file contents into memory. Yes the whole file!!

void __stdcall report_error(char * text) {
    printf(text);
    ExitProcess(-1);
}

int main(int argc, char * argv[]) {
    if (argc != 2) report_error("Missing input file!\n");
    HANDLE handle = CreateFile(argv[1], GENERIC_READ, 0, nullptr, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, nullptr);
    if (handle == INVALID_HANDLE_VALUE) report_error("Unable to open input file!\n");
    DWORD fileSize = GetFileSize(handle, nullptr);
    char * source = reinterpret_cast<char *>(VirtualAlloc(nullptr, fileSize + 1, MEM_COMMIT, PAGE_READWRITE));
    DWORD bytes_read;
    ReadFile(handle, source, fileSize, &bytes_read, nullptr);
    source[fileSize] = 21;
    CloseHandle(handle);



Notice that we have requested memory for the size of the file and an additional byte that will be used to indicate the end of the data (I personally like to use the ASCII character 21 (don't ask me why!).

Defining the Structure To Hold The Data

typedef struct data_record {
    char * field1;
    char * field2;
    char * field3;
    char * field4;
    char * field5;
} data_record;



Hopefully you all understand that structure.

Defining The Array To Hold The Structures

Each structure is 5 x 8 bytes (64 bit pointers) = 40 bytes. One million records of 40 bytes will occupy 40,000,000 bytes. Let's allocate that.

    data_record * array = reinterpret_cast<data_record *>(VirtualAlloc(nullptr, 40000000, MEM_COMMIT, PAGE_READWRITE));



Processing The Source Information In A Loop

This is a loop, that is terminated by that ASCII character 21. Here is the loop.

    while ( true ) {
        if (*source_pointer == 21) break;
        /* process the line up to and included CR or LF or CRLF */
    }



This is where it starts getting interesting! Each line consists of a double quoted string followed by a comma and finally a double quoted string followed by either a carriage return character or a line feed character or carriage return and a line feed.

So we can set a pointer to the start of our array and process the five fields as follows.

    while (true) {
        if (*source_pointer == 21) break;
        source_pointer++;
        this_record->field1 = source_pointer;
        while (*source_pointer != '\"') source_pointer++;
        *source_pointer++ = 0;
        source_pointer++;
        if (*source_pointer++ != '\"') report_error("Missing double quotes\n");
        this_record->field2 = source_pointer;
        while (*source_pointer != '\"') source_pointer++;
        *source_pointer++ = 0;
        source_pointer++;
        if (*source_pointer++ != '\"') report_error("Missing double quotes\n");
        this_record->field3 = source_pointer;
        while (*source_pointer != '\"') source_pointer++;
        *source_pointer++ = 0;
        source_pointer++;
        if (*source_pointer++ != '\"') report_error("Missing double quotes\n");
        this_record->field4 = source_pointer;
        while (*source_pointer != '\"') source_pointer++;
        *source_pointer++ = 0;
        source_pointer++;
        if (*source_pointer++ != '\"') report_error("Missing double quotes\n");
        this_record->field5 = source_pointer;
        while (*source_pointer != '\"') source_pointer++;
        *source_pointer++ = 0;
        source_pointer += 2;
        this_record++;
    }




Notice that we are actually using the data that we read from the file into memory and using pointers into that memory to complete the table. Also note that the end of an individual field as in the double quoted character is being overwritten with the null character such that the field is now null terminated.

We now have all of the records held in our array and the record fields are pointing to the null terminated strings held in the source that has been read into memory. Neat eh?

This post has been edited by Martyn.Rae: 27 July 2017 - 05:28 AM


Is This A Good Question/Topic? 0
  • +

Replies To: Processing Data Held In A Comma Separated File

#2 CTphpnwb  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 3695
  • View blog
  • Posts: 13,355
  • Joined: 08-August 08

Posted 24 July 2017 - 04:36 PM

So what happens if there is no double quote in the text?
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1