Page 1 of 1

Microsoft : Implementing an Indexed Table : Part I and II Rate Topic: -----

#1 Martyn.Rae  Icon User is offline

  • The programming dinosaur
  • member icon

Reputation: 545
  • View blog
  • Posts: 1,420
  • Joined: 22-August 09

Posted 26 July 2017 - 12:33 PM

Introduction

Accessing data held in a table in memory requires careful thought. The Standard Template Library (STL) does not 'cut the mustard' as far as providing fast access to a huge table consisting of hundreds of thousands or more of entries. What I am about to show you is code that can load millions of entries into a table in minutes! Furthermore, this solution can provide forward and backward access through sorted data, search for an individual entry based on it's key and access to an individual entry based on it's table index. Sound to good to be true? Read on!!!

Alternatives

Chained Lists

Chained lists require either individual entries to have forward or forward/backward pointers, or wrapper classes that contain these pointers and either a copy of the item or simply a pointer to the item.

<item_entry> := <forward pointer><backward pointer><item>

or

<item_entry> := <forward pointer><backward pointer><item pointer>


This solution is only good for small amounts of data. Locating the position at which a new item is to be inserted grows exponentially as the list grows.

Hash Table Chained Lists

An variation of the chained list is to use a hash table that consists multiple chained lists. The key of the item being inserted is used as the input to some hashing algorithm that produces a value that when divided by some fixed value provides a remainder that can be used as the index into the hash table and therefore the start of the chain into which the item will be inserted.

The hash table solution does provide a means of inserting more items, but again cannot handle more that at most several thousand entries without individual chains suffering the same problem as a single chained list.

The Solution

The solution is really quite simple and very elegant. Consider a system of index node blocks and data leaf blocks (yes, a bit like a tree with branches and leaves). The index node blocks consist of <key pointer>/<node pointer> pairs held in sorted order. The <key pointer> is a pointer to the highest key that is contained within the <key pointer>/<node pointer> pair and the node points to either another (lower) index node block or a data leaf block. The data leaf block contains a sorted list of <key pointer> entries. Each index node block contains 255 <key pointer>/<node pointer> pairs. Each data leaf block contains 510 pointers to individual items. The reason for these rather exacting numbers are not randomly picked, but computed as will be seen later.

Now, let's say we have 10,000,000 items we need to load into the table. This would give us 10,000,000 / 510 data leaf blocks - 19,608 in total. The parent index node blocks can contain 255 <key pointer>/<node pointer> pairs. This means that we need 19,608 / 255 higher level index node blocks - 77 in total. To accommodate that number of <key pointer>/<node pointer> pairs requires one higher level index node block known as the root node block.


root block
/|\
index block #0 ... index block #76
/|\ /|\
leaf block #0 ... leaf block #19607


Based on the aforementioned data, we have a root node (level 0) whose <key pointer>/<node pointer> pairs point to level 1 index node blocks. Each index node blocks contains <key pointer>/<node pointer> pairs used to point to an individual data leaf block.

To locate the position of any given item would mean that worst case scenario we have to compare the item's key 77 + 255 = 332 times before we have the data leaf block that will contain the item's key.

Node and Data Block Size

We have to determine the size of the index node and data leaf block size. Firstly, it has to be remembered that data has to be held in sorted order, so we cannot have blocks that are very large as that will increase the overhead of moving data within the block. Similarly, we do not want very small blocks as that could mean we will be spending more time allocating memory rather than focusing on the insertion.

I have chosen to use the natural page size of the Microsoft Windows system which is 4096 bytes and can then use VirtualAlloc to allocate 4096 byte blocks (if you are using a different operating system try not to use heap allocations as the 'behind the scenes' housekeeping can be rather excessive).

The block header for both the index node and leaf data blocks consists of three fields a <node type>, an <entry count> and a <parent pointer>. The <node type> and <entry count> are each 32-bits wide and the pointer is 64-bits. We have then a total of 16 bytes of header. Subtracted from the 4096 block size, gives us 4080 bytes free for the block content. For leaf data blocks, using 64-bit pointers, 4080 / 8 gives us 510 entries and for the index node blocks (again using 64-bit pointers) each <key pointer>/<node pointer> pair would be 16 bytes and that would give us 255 index entries.

Adding an Entry

To add an entry into the system, we will have a routine to compare the entries key with each of the key's held in the root node until we find a key that is greater than or equal to the entries key. We then take the node pointer associated with the larger key and call the routine again using that node pointer. The process is repeated until we have a leaf data block (as we have shown, this requires one single recursive call. We then have to compare the leaf entries with the entry being added until we find a leaf entry that is greater than or equal to it. This is where the entry being inserted needs to be placed.

This post has been edited by Martyn.Rae: 27 July 2017 - 12:02 PM


Is This A Good Question/Topic? 0
  • +

Replies To: Microsoft : Implementing an Indexed Table : Part I and II

#2 Martyn.Rae  Icon User is offline

  • The programming dinosaur
  • member icon

Reputation: 545
  • View blog
  • Posts: 1,420
  • Joined: 22-August 09

Posted 27 July 2017 - 05:08 AM

PART II

In Part I of this tutorial, we described a rough outline of a means of loading a humongous number of entries into a table, yet providing us with a super fast way of accessing items in that table. In this part, we shall look at the code and discuss some of the more interesting points.

Here is the code in it's entirety.

#include <Windows.h>
#include <stdio.h>
#include <stdlib.h>

typedef enum NODE_TYPE :__int32 { INDEX_NODE, DATA_NODE } NODE_TYPE;
typedef int(__stdcall *COMPARE_ITEM)(void * item1, void * item2);

typedef struct NODE_ENTRY {
    void * item;
    void * node;
} NODE_ENTRY;

typedef struct NODE {
    NODE_TYPE     node_type;
    __int32       entry_count;
    struct NODE * parent_node;
    NODE_ENTRY    entry[255];
} NODE;

typedef struct LEAF {
    NODE_TYPE     node_type;
    __int32       entry_count;
    struct NODE * parent_node;
    void *        entry[510];
} LEAF;

unsigned char * item_pointer;
NODE * root_node;

NODE * __stdcall create_node(NODE_TYPE node_type) {
    NODE * node = reinterpret_cast<NODE *>(VirtualAlloc(nullptr, 4096, MEM_COMMIT, PAGE_READWRITE));
    node->node_type = node_type;
    return node;
}

int __stdcall compare(void * item1, void *item2) {
    return memcmp(reinterpret_cast<char *>(item1), reinterpret_cast<char *>(item2), 32);
}

void __stdcall split_root_node(NODE * this_node) {
    NODE * split_node = reinterpret_cast<NODE *>(create_node(INDEX_NODE));
    memmove(&split_node->entry[0], &this_node->entry[127], 2048);
    memset(&this_node->entry[127], 0, 2048);
    this_node->entry_count = 127;
    split_node->entry_count = 128;
    for (int ix = 0; ix < 128; ix++) {
        reinterpret_cast<NODE *>(split_node->entry[ix].node)->parent_node = split_node;
    }
    if (this_node->parent_node == nullptr) {
        NODE * new_node = reinterpret_cast<NODE *>(create_node(INDEX_NODE));
        this_node->parent_node = new_node;
        split_node->parent_node = new_node;
        new_node->entry_count = 2;
        new_node->entry[0].item = this_node->entry[126].item;
        new_node->entry[0].node = this_node;
        new_node->entry[1].item = split_node->entry[127].item;
        new_node->entry[1].node = split_node;
        root_node = new_node;
        return;
    }
    else {
        for (int ix = 0; ix < this_node->parent_node->entry_count; ix++) {
            if (this_node->parent_node->entry[ix].node == this_node) {
                char * p1 = reinterpret_cast<char *>(&this_node->parent_node->entry[ix]);
                char * p2 = reinterpret_cast<char *>(&this_node->parent_node->entry[ix + 1]);
                memmove(p2, p1, (this_node->parent_node->entry_count - ix) * 16);
                this_node->parent_node->entry[ix].item = this_node->entry[126].item;
                this_node->parent_node->entry[ix + 1].node = split_node;
                this_node->parent_node->entry_count++;
                split_node->parent_node = this_node->parent_node;
                if (this_node->parent_node->entry_count == 255) {
                    split_root_node(this_node->parent_node);
                }
                return;
            }
        }
        printf("Something has gone seriously wrong - We have lost a node!\n");
        ExitProcess(-1);
    }
}

void __stdcall insert_node_item(NODE * this_node, LEAF * left, void * left_item, LEAF * right) {
    INT node_index;
    for (node_index = 0; node_index < this_node->entry_count; node_index++) {
        if (this_node->entry[node_index].node == left) {
            char * p1 = reinterpret_cast<char *>(&this_node->entry[node_index]);
            char * p2 = reinterpret_cast<char *>(&this_node->entry[node_index + 1]);
            memmove(p2, p1, (this_node->entry_count - node_index) * 16);
            this_node->entry[node_index].item = left_item;
            this_node->entry[node_index + 1].node = right;
            this_node->entry_count++;
            if (this_node->entry_count == 255) {
                split_root_node(this_node);
            }
            return;
        }
    }
    printf("Something has gone seriously wrong - This should never have happened!\n");
    ExitProcess(-1);
}

void __stdcall insert_leaf_item(LEAF * leaf, INT node_index, void * item) {
    char * p1 = reinterpret_cast<char *>(&leaf->entry[node_index]);
    char * p2 = reinterpret_cast<char *>(&leaf->entry[node_index + 1]);
    memmove(p2, p1, (leaf->entry_count - node_index) * 8);
    leaf->entry[node_index] = item;
    leaf->entry_count++;
    if (leaf->entry_count == 510) {
        LEAF * new_node = reinterpret_cast<LEAF *>(create_node(DATA_NODE));
        memmove(&new_node->entry[0], &leaf->entry[255], 2040);
        memset(&leaf->entry[255], 0, 2040);
        leaf->entry_count = 255;
        new_node->entry_count = 255;
        new_node->parent_node = leaf->parent_node;
        insert_node_item(leaf->parent_node, leaf, leaf->entry[254], new_node);
    }
}

void __stdcall add_item(NODE * node, void * item, COMPARE_ITEM compare_item) {
    if (node->node_type == INDEX_NODE) {
        INT node_index;
        for (node_index = 0; node_index < node->entry_count; node_index++) {
            if (compare_item(node->entry[node_index].item, item) > 0) {
                add_item(reinterpret_cast<NODE *>(node->entry[node_index].node), item, compare_item);
                return;
            }
        }
    }
    else {
        LEAF * leaf = reinterpret_cast<LEAF *>(node);
        INT node_index;
        for (node_index = 0; node_index < leaf->entry_count; node_index++) {
            if (compare_item(leaf->entry[node_index], item) > 0) {
                insert_leaf_item(leaf, node_index, item);
                return;
            }
        }
    }
}

void __stdcall print_data(NODE * current_node) {
    if (current_node->node_type == DATA_NODE) {
        for (int ix = 0; ix < current_node->entry_count; ix++) {
            printf("%32s\n", reinterpret_cast<char *>(reinterpret_cast<LEAF *>(current_node)->entry[ix]));
        }
        return;
    }
    for (int ix = 0; ix < current_node->entry_count; ix++) {
        print_data(reinterpret_cast<NODE *>(current_node->entry[ix].node));
    }
}

int main() {
    unsigned char * item_base = item_pointer = (unsigned char *)VirtualAlloc(nullptr, 1000000000, MEM_COMMIT, PAGE_READWRITE);
    unsigned char * data = item_pointer;
    root_node = create_node(INDEX_NODE);
    for (int ix = 0; ix < 32; ix++) *item_pointer++ = 'Z';
    *item_pointer++ = 0;
    root_node->entry[0].item = data;
    root_node->entry[0].node = create_node(DATA_NODE);
    root_node->entry_count = 1;
    reinterpret_cast<LEAF *>(root_node->entry[0].node)->parent_node = root_node;
    reinterpret_cast<LEAF *>(root_node->entry[0].node)->entry[0] = data;
    reinterpret_cast<LEAF *>(root_node->entry[0].node)->entry_count = 1;
    int loop_index;
    srand(11113);
    for (loop_index = 0; loop_index < 10000000; loop_index++) {
        unsigned char * data = item_pointer;
        for (int ix = 0; ix < 32; ix++) *key_pointer++ = static_cast<char>((rand() % 26) + 'A');
        *item_pointer++ = 0;
        add_item(root_node, data, compare);
    }
    print_data(root_node);
    return 0;
}



The Include Files

#include <Windows.h>
#include <stdio.h>
#include <stdlib.h>



We include the windows.h file, as standard (for Microsoft Windows of course) and the stdio.h include file so we can use the printf function and finally the stdlib.h so we can use the pseudo-random number generator functions srand and rand.

The Definitions

typedef enum NODE_TYPE :__int32 { INDEX_NODE, DATA_NODE } NODE_TYPE;
typedef int(__stdcall *COMPARE_ITEM)(void * item1, void * item2);



The first of these two lines provides us with the ability to determine the type of block we are dealing with (i.e. an INDEX_NODE or a DATA_NODE).
The second line defines a function pointer that has two parameters. The solution needs to be as generic as possible, even though the code can only be compiled and run on a Microsoft Windows system. Using this mechanism means that we can pass the address of the two items that need comparison without our routines having to know the internal structure of the item, the offset position of the start of the key relative to the item and key structure itself (e.g. could be some data record with a composite key consisting of a mix of fields).

typedef struct NODE_ENTRY {
    void * item;
    void * node;
} NODE_ENTRY;

typedef struct NODE {
    NODE_TYPE     node_type;
    __int32       entry_count;
    struct NODE * parent_node;
    NODE_ENTRY    entry[255];
} NODE;

typedef struct LEAF {
    NODE_TYPE     node_type;
    __int32       entry_count;
    struct NODE * parent_node;
    void *        entry[510];
} LEAF;



The NODE_ENTRY structure is used to define the <item pointer> and associated <node pointer> pair that is used as part of an index node. The <item pointer> being the highest key in the associated node (regardless if the associated node is an index node block or a data node block).

The NODE structure defines an index node and consists of the <node_type> enumeration, the <entry_count> field is a count of the number of entries used in this index node, <parent_node> is a pointer to the index node that owns this index node (or is a nullptr if the index node is the root node), and finally the <entry> table that contains the index node entry pairs.

The LEAF structure defines the data node and consists of the <node_type> enumeration, the <entry_count> field is a count of the number of entries used in this data node, <parent_node> is a pointer to the index node that owns this data node, and finally the <entry> table that contains the pointers to individual items.

The create_node Function

NODE * __stdcall create_node(NODE_TYPE node_type) {
    NODE * node = reinterpret_cast<NODE *>(VirtualAlloc(nullptr, 4096, MEM_COMMIT, PAGE_READWRITE));
    node->node_type = node_type;
    return node;
}



This function is used to obtain a pointer to a new block that has been initialised to the node type passed as the parameter. Now, it should be noted at this point, one could use any memory allocation routine in order to obtain a new block. We have chosen the VirtualAlloc function as it has an extremely low footprint.

The main Function

My apologies for skipping the next set of functions and jumping to the main function. This is necessary for 'setting the scene' so to speak'.

int main() {
    unsigned char * item_base = item_pointer = (unsigned char *)VirtualAlloc(nullptr, 1000000000, MEM_COMMIT, PAGE_READWRITE);
    unsigned char * data = item_pointer;
    root_node = create_node(INDEX_NODE);
    for (int ix = 0; ix < 32; ix++) *item_pointer++ = 'Z';
    *item_pointer++ = 0;
    root_node->entry[0].item = data;
    root_node->entry[0].node = create_node(DATA_NODE);
    root_node->entry_count = 1;
    reinterpret_cast<LEAF *>(root_node->entry[0].node)->parent_node = root_node;
    reinterpret_cast<LEAF *>(root_node->entry[0].node)->entry[0] = data;
    reinterpret_cast<LEAF *>(root_node->entry[0].node)->entry_count = 1;
    int loop_index;
    srand(11113);
    for (loop_index = 0; loop_index < 10000000; loop_index++) {
        unsigned char * data = item_pointer;
        for (int ix = 0; ix < 32; ix++) *key_pointer++ = static_cast<char>((rand() % 26) + 'A');
        *item_pointer++ = 0;
        add_item(root_node, data, compare);
    }
    print_data(root_node);
    return 0;
}



The first of code requires some explanation as we are allocating one gigabyte of memory. This is simply to hold the items that we are going to randomly generate so we have something to put into the table. We have chosen to create items of 33 bytes in length but in actual fact as we are simply going to be storing pointers in the index and data blocks, the item size itself is not important. We have chosen to make the key a 32 byte random string of upper case alphabetic characters terminated by a null character so that the item and key are the same thing.

Next, we create the root node as an index node, create a 32 byte item consisting of all 'Z' and terminate it with a null character. The start address of this all 'Z' string is then used as the root node's first table entry and the node associated with this key is created as a data block.

Why are we doing this? Well, this is very important to the scheme of things as we do not have to make a special exception for the first item added to the table and also ensures that all entries added will be placed before the all 'Z' index entry. If we did not do the latter, we would have to concern ourselves with some very ugly code that would run back from the data node through the parent nodes to the root node, changing the highest key item held in each of the index nodes. Trust me that is not a direction you want to go in!

It should also be noted that we could have used 0xFF as the character to use to fill the first string, but as we are going to generate keys consisting of upper case alphabetic characters this would seem a little pointless.

The next few lines finish complete the root node set up including adding the pointer to the all 'Z' key we have just generated.

The srand(11113); line of code is simply to seed the random number generator. You can of course skip this line of code, but for testing purposes, setting the seed to some arbitrary value will generate the same set of random numbers regardless of how many times the rand() function is called. This is very useful as we know that say for example the key 'ZZYSTKZRIYWICMJRGZXJUSDMQHPYSGPT' will always be generated (providing the number of entries we are adding does not change).

Next we have our loop that creates our 33 byte random item and calls the add_item routine. Notice the add_item routine always passes the pointer to the root node as the first parameter, a pointer to the item that is to be added to the table and finally, a pointer to the routine that is to be used to compare two items.

Finally, we call the routine print_data to print out each item within the table. This is of course not required but again for testing purposes is very useful.

The add_item Function

void __stdcall add_item(NODE * node, void * item, COMPARE_ITEM compare_item) {
    if (node->node_type == INDEX_NODE) {
        INT node_index;
        for (node_index = 0; node_index < node->entry_count; node_index++) {
            if (compare_item(node->entry[node_index].item, item) > 0) {
                add_item(reinterpret_cast<NODE *>(node->entry[node_index].node), item, compare_item);
                return;
            }
        }
    }
    else {
        LEAF * leaf = reinterpret_cast<LEAF *>(node);
        INT node_index;
        for (node_index = 0; node_index < leaf->entry_count; node_index++) {
            if (compare_item(leaf->entry[node_index], item) > 0) {
                insert_leaf_item(leaf, node_index, item);
                return;
            }
        }
    }
}



This is the routine that we use to add an item into a data block. We start off by determining the type of node that has been passed to us.

Initially, this will be the root node, but notice that at line 6 of the extracted block above, we call ourselves with a pointer to a node block where the compare_item routine (parameter #3) has return a +ve value indicating that the item in the node entry table is greater than the item we are adding.

If the node (parameter #1) we have been passed is not a data node, then we again run through that nodes entries until the compare_item routine (parameter #3) has return a +ve value indicating that the item in the node entry table is greater than the item we are adding. This provides us with the data block into which the item being added is to be placed and it's index within that node's table entries.

We call insert_leaf_item with the pointer to the node into which the item is to be placed (parameter #1), the index of the entry that needs to contain the item being added (parameter #2) and of course the item pointer (parameter #3).

The insert_leaf_item Function

void __stdcall insert_leaf_item(LEAF * leaf, INT node_index, void * item) {
    char * p1 = reinterpret_cast<char *>(&leaf->entry[node_index]);
    char * p2 = reinterpret_cast<char *>(&leaf->entry[node_index + 1]);
    memmove(p2, p1, (leaf->entry_count - node_index) * 8);
    leaf->entry[node_index] = item;
    leaf->entry_count++;
    if (leaf->entry_count == 510) {
        LEAF * new_node = reinterpret_cast<LEAF *>(create_node(DATA_NODE));
        memmove(&new_node->entry[0], &leaf->entry[255], 2040);
        memset(&leaf->entry[255], 0, 2040);
        leaf->entry_count = 255;
        new_node->entry_count = 255;
        new_node->parent_node = leaf->parent_node;
        insert_node_item(leaf->parent_node, leaf, leaf->entry[254], new_node);
    }
}



This routine simply creates a 'hole' into which the item being added is to be placed, inserts the item pointer into that hole and increments the number of entries being used in the node's entry table.

Here is the fun part! What do we do when we have completely filled all entries in a specific nodes entry table? Well, we have to split the node that has been filled into two halves. So we create a new data block, move the last half of the data entry pointers into the start of the new block and then clear the top half of the entries in the node that was full. Finally, we call insert_node_item to update the index node with the new data block entry.

insert_node_item

void __stdcall insert_node_item(NODE * this_node, LEAF * left, void * left_item, LEAF * right) {
    INT node_index;
    for (node_index = 0; node_index < this_node->entry_count; node_index++) {
        if (this_node->entry[node_index].node == left) {
            char * p1 = reinterpret_cast<char *>(&this_node->entry[node_index]);
            char * p2 = reinterpret_cast<char *>(&this_node->entry[node_index + 1]);
            memmove(p2, p1, (this_node->entry_count - node_index) * 16);
            this_node->entry[node_index].item = left_item;
            this_node->entry[node_index + 1].node = right;
            this_node->entry_count++;
            if (this_node->entry_count == 255) {
                split_root_node(this_node);
            }
            return;
        }
    }
    printf("Something has gone seriously wrong - This should never have happened!\n");
    ExitProcess(-1);
}



The first part of this routine adds a new <NODE_ENTRY> pair into the node's entry table. To do this, we have to recognise the fact that the 'left' node (i.e. the node that became full) is already in the table. This entry is located and then the memmove function will move all entries from that point one entry down. So if we found the left node at position 82 in the entry table, we would move entry 82 to 83, 83 to 84, 84 to 85 etc. Entry 82 and entry 83 would contain the same <item_pointer>/<node_pointer> pair. All we have to do with entry 82 is to update the item pointer to the new highest entry item in that node (parameter #3). Entry 83's <key pointer> is correct as it contains the highest item key which although is now in a different entry position in the split node, remains the same. This entry simply requires it's node pointer to be updated to the address of the split node (parameter #4).

If the node's entry table becomes full, then we have to split the index node into two halves. This is performed by calling the routine split_root_node.

split_node_item

void __stdcall split_root_node(NODE * this_node) {
    NODE * split_node = reinterpret_cast<NODE *>(create_node(INDEX_NODE));
    memmove(&split_node->entry[0], &this_node->entry[127], 2048);
    memset(&this_node->entry[127], 0, 2048);
    this_node->entry_count = 127;
    split_node->entry_count = 128;
    for (int ix = 0; ix < 128; ix++) {
        reinterpret_cast<NODE *>(split_node->entry[ix].node)->parent_node = split_node;
    }
    if (this_node->parent_node == nullptr) {
        NODE * new_node = reinterpret_cast<NODE *>(create_node(INDEX_NODE));
        this_node->parent_node = new_node;
        split_node->parent_node = new_node;
        new_node->entry_count = 2;
        new_node->entry[0].item = this_node->entry[126].item;
        new_node->entry[0].node = this_node;
        new_node->entry[1].item = split_node->entry[127].item;
        new_node->entry[1].node = split_node;
        root_node = new_node;
        return;
    }
    else {
        for (int ix = 0; ix < this_node->parent_node->entry_count; ix++) {
            if (this_node->parent_node->entry[ix].node == this_node) {
                char * p1 = reinterpret_cast<char *>(&this_node->parent_node->entry[ix]);
                char * p2 = reinterpret_cast<char *>(&this_node->parent_node->entry[ix + 1]);
                memmove(p2, p1, (this_node->parent_node->entry_count - ix) * 16);
                this_node->parent_node->entry[ix].item = this_node->entry[126].item;
                this_node->parent_node->entry[ix + 1].node = split_node;
                this_node->parent_node->entry_count++;
                split_node->parent_node = this_node->parent_node;
                if (this_node->parent_node->entry_count == 255) {
                    split_root_node(this_node->parent_node);
                }
                return;
            }
        }
        printf("Something has gone seriously wrong - We have lost a node!\n");
        ExitProcess(-1);
    }
}



This routine will split an index node into two halves and either create a new root node if the node being split does not have a parent node, or update the parent's node entry table to reflect the split.

This post has been edited by Martyn.Rae: 27 July 2017 - 11:41 AM

Was This Post Helpful? 0
  • +
  • -

#3 Skydiver  Icon User is offline

  • Code herder
  • member icon

Reputation: 5923
  • View blog
  • Posts: 20,251
  • Joined: 05-May 12

Posted 02 August 2017 - 08:25 PM

View PostMartyn.Rae, on 26 July 2017 - 03:33 PM, said:

This solution is only good for small amounts of data. Locating the position at which a new item is to be inserted grows exponentially as the list grows.

Searching for the insertion point in a linked list grows linearly, not exponentially. It's an O(N) operation.
Was This Post Helpful? 0
  • +
  • -

#4 Skydiver  Icon User is offline

  • Code herder
  • member icon

Reputation: 5923
  • View blog
  • Posts: 20,251
  • Joined: 05-May 12

Posted 02 August 2017 - 08:32 PM

View PostMartyn.Rae, on 27 July 2017 - 08:08 AM, said:

The Definitions

typedef enum NODE_TYPE :__int32 { INDEX_NODE, DATA_NODE } NODE_TYPE;
typedef int(__stdcall *COMPARE_ITEM)(void * item1, void * item2);



The first of these two lines provides us with the ability to determine the type of block we are dealing with (i.e. an INDEX_NODE or a DATA_NODE).


You are writing C++ code based on all your (gratuitous) use of reinterpret_cast<> everywhere. Since you are using C++, there is no need to typedef ... NODE_TYPE. In C++ when you declare an enum, a type is automatically created.

As aside, the typical C++ style is for constants to be all capital letters, but not types. You seem to have transliterated some C code into C++.
Was This Post Helpful? 0
  • +
  • -

#5 Skydiver  Icon User is offline

  • Code herder
  • member icon

Reputation: 5923
  • View blog
  • Posts: 20,251
  • Joined: 05-May 12

Posted 02 August 2017 - 08:39 PM

View PostMartyn.Rae, on 27 July 2017 - 08:08 AM, said:

typedef struct NODE_ENTRY {
    void * item;
    void * node;
} NODE_ENTRY;

typedef struct NODE {
    NODE_TYPE     node_type;
    __int32       entry_count;
    struct NODE * parent_node;
    NODE_ENTRY    entry[255];
} NODE;

typedef struct LEAF {
    NODE_TYPE     node_type;
    __int32       entry_count;
    struct NODE * parent_node;
    void *        entry[510];
} LEAF;



And again, in C++ there is no need for all these typedefs. You could have simple declared the struct's and done with it.
Was This Post Helpful? 0
  • +
  • -

#6 Martyn.Rae  Icon User is offline

  • The programming dinosaur
  • member icon

Reputation: 545
  • View blog
  • Posts: 1,420
  • Joined: 22-August 09

Posted 03 August 2017 - 09:32 PM

Thank you very much for those comments. The term 'exponentially' was not meant to be taken literally as a mathematical term but more as a means of stating that the more items inserted, the more time is taken to complete the operation.

As for the latest C++ standards and specifications you have pointed out, I was unaware of them. Ignorance is bliss as they say. However, in my defence, I do not like compilers making assumptions about things. I would rather spend the extra time typing unnecessary characters so I am quite sure everybody knows this is the intention.

The code was written using the albeit old C++ standards but in the flavour of the C programming language as there isn't a hint of objects and classes. I agree however, the code here is neither C (as in the original specification) or C++ as it stands today!

This post has been edited by Martyn.Rae: 03 August 2017 - 09:34 PM

Was This Post Helpful? 0
  • +
  • -

#7 Skydiver  Icon User is offline

  • Code herder
  • member icon

Reputation: 5923
  • View blog
  • Posts: 20,251
  • Joined: 05-May 12

Posted 04 August 2017 - 04:04 AM

Nothing I pointed out above is new in C++. They had always been there since the original version of the language. I think what you had been using/reading the older Microsoft Win32 API header files and mimicked the style there because those headers are meant to be included by either C or C++ compilers.

Or you are part of the generation that learned C++ as simple "C with classes". I was part of that generation. :)/>
Was This Post Helpful? 0
  • +
  • -

#8 Skydiver  Icon User is offline

  • Code herder
  • member icon

Reputation: 5923
  • View blog
  • Posts: 20,251
  • Joined: 05-May 12

Posted 04 August 2017 - 04:45 AM

So after following the code and stepping back from the minutea, it looks like you have implemented a B-tree where the only difference is that you have N keys and N pointers instead of the normal implementation of N keys and N+1 pointers. In your Part III, it looks like you are implementing a B+ tree with the next and previous pointers.
Was This Post Helpful? 0
  • +
  • -

#9 Martyn.Rae  Icon User is offline

  • The programming dinosaur
  • member icon

Reputation: 545
  • View blog
  • Posts: 1,420
  • Joined: 22-August 09

Posted 04 August 2017 - 10:04 PM

Sadly you are very wrong on so many levels!

The B-Tree as you refer to it came some 5 years later (and I strongly suspect was based on the partial indexing data access mechanism described in I.C.T. 1900 Series Direct Access Techniques manual from 1966).

The implementation I have provided is a slightly modified version of the partial indexing mechanism that I worked with for some 10 years in the early 70's permitting it to be used in memory rather than on disc.

And no, I was not brought up as part of the generation that learned C++ as simple "C with classes". I was and still am an assembly programmer. I think live and dream assembly - a true programming dinosaur!!!
Was This Post Helpful? 0
  • +
  • -

#10 Skydiver  Icon User is offline

  • Code herder
  • member icon

Reputation: 5923
  • View blog
  • Posts: 20,251
  • Joined: 05-May 12

Posted 05 August 2017 - 08:19 AM

I would not be surprised if your suspicion bears fruit. A lot of lore was lost from early days of computing when people were working in silos and avenues for sharing ideas were not quite as easy as they are now.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1