Subscribe to Grim's Projects        RSS Feed
-----

Heaps and Heaps of Data

Icon Leave Comment
Upon writing my code as I originally had planned I realized there were certain pieces of information I wouldn't need any longer. The original offsets and data lengths that Mimesis files utilized were necessary because information was being stored in a row-oriented manner. Now that I've changed the paradigm from rows to columns those pieces of data became obsolete. My original scheme was a master file with a structure and data file for each column. The result is that the structure files have been entirely removed. So now a Mimesis data store will be composed of one master file and data files, each data file being equivalent to one column of the "table".

Another change is that the master file no longer denotes one column as "primary". Instead, upon calling methods of the Mimesis class that could involve operations with a "primary" key, a boolean will control whether to use the first column listed in the parameters as the "primary".

These two subtle, but important changes make all file composition consistent from a bytewise perspective. i.e. all Mimesis files have the following composition:

<?php /*somedatasomedataDELIMETERsomedatasomedataDELIMETER...*/?>


Much as a flat file database would, it's basically segments of data broken up by some delimeter. The main difference is that a flat file database has to be rewritten entirely when an update is made, the delimeter is typically one character and never changes, and the delimeter itself cannot be used in the various fields.

It occurred to me as well that one of the powerful aspects of Mimesis was its heap method of data storage, but it was as yet incomplete. Incomplete in the sense that deletions could no longer be achieved by simply removing an offset and a length as before. When a deletion occurred previously, that was the procedure used. The locating data of the structural file would be removed. Thusly, Mimesis could no longer locate the row in the data file. The new hurdle was that lengths and offsets are no longer present so how to go about deleting data without rewriting the whole database?

The solution is the tried and true method of utilizing a flag. The previous version, as well as this version have always had 256 possible delimeters, of which only 5 or so are in actual use. Therefore, I made use of one of the delimeters to denote a deleted portion of data.

Most of the new Mimesis code makes heavy use of regular expressions in order to sift through the data files. The foremost process is typically to retrieve one of the Mimesis files and then to filter out the deleted data. Now that you're left solely with valid data (i.e. not deleted from the heap) you can perform the operations you want.

So why not just delete the data permanently? This reverts to the original premise of the virtues of using a heap. It's simply more reliable to change two bytes in a file that already exists, rather than to rewrite thousands of bytes at one time. Less errors, less time, and less possibility of data corruption/loss.

The drawback is an increase in "bloat". The more and more "bloat" data (deletions, repetitions) accrued, the longer the regular expression searches will take to decompose the data of a file. This is why it will be necessary to implement a refreshing method just as before, in order to decrease the bloat and speed up efficiency. Refreshing would rewrite all files preserving only non-"bloat".

Thus far I have coded a getColumn, insertField, and entries method. getColumn retrieves any number of columns from the store and can also do it by making use of regular expressions, it also supports using the first column listed in the parameters as the "primary". insertField creates all the relevant files (should they not exist) and does the job of inserting column data into the store, it can be passed either as single items or as arrays of items (i.e. a single field or a whole column). entries is an analysis tool for measuring bloat. It will return the number of unique fields, deleted fields, and heap (valid but possibly repeated) fields present in the store dependent upon the flags passed. It also supports the use of regular expressions for stipulating which columns to analyze.

If I had to choose a word to describe how my brain and coding methodology are working right now, it would be flexibility. Mimesis is now more freeform than ever. I'm both fascinated and equally creeped out by many of the ideas that are entering my brain and the direction Mimesis is taking.

0 Comments On This Entry