Subscribe to Grim's Projects        RSS Feed

Experiments in Packing and Unpacking

Icon Leave Comment
So, I pulled another piratey move and completely reworked the Mimesis system. Yeah, I know its getting old already; can't be helped. For whatever reason I keep coming up with "improvements". Just so happens I may have broken some stuff along the way, but I'm not going to dwell so neither should you.

What am I doing differently?

I came to realize that the methods tableExists and structExists as well as deleteTable were just excess baggage and therefore deleted. The first two methods can be replaced with a simple file_exists(Mimesis::table(true)) and that's that, and deleteTable is equally useless because its easily done with unlink(Mimesis::table(true)).

The renameRow method has been excluded because it messed with certain numbers inside of the structural file. Until I revise it (if that's even feasible given the changes I've made) it will remain excluded.

There was a bug in the Mimesis::getRow method when a regular expression was used, this has been fixed.

The desanitize and sanitize functions no longer make use of serialize or unserialize within the function bodies. The serialization functions have been left to be applied where needed as needed, either by sanitizing serialized data or by unserializing desanitized data. The Polarizer class changed to reflect this and a couple of aesthetic code changes were also imparted to the Polarizer class's constants.

The most important transformation occurs at the file level. Namely, the structural file's contents have been modified heavily. The old structure of the structural file was as follows:
<?php /*row1FSrow2FS...SSoffset1FSoffset2FS...SSlength1FSlength2FS...SScolumn1FScolumn2FS...SSnextoffsetFSmodificationsFSSS*/?>

The experimental structure that's currently in place is:
<?php /*row1FSrow2FS...SSpackedoffsetsSSpackedlengthsSScolumn1FScolumn2FS...SSpackednextoffset,uniques,modifications*/?>

The difference is that offsets, lengths, next offset, and modifications were stored as character data specially separated. Now however, there is no need for special separation because since all of this data is numeric, and of the integer variety, I've elected to use the pack function to store every number into 4 bytes of data. It makes the structural file a bit less human readable, but you gain more storage capacity. In a previous blog entry I mentioned how the size of a structural file had gone from 250 kb to 200 kb, but with the new method its been reduced to 140 kb (another 30% improvement).

Using the pack and unpack functions is quicker than using implode and explode to achieve a similar effect, and also sanitize and desanitize can now be applied to the entire section rather than piece by piece. All of these operations should help Mimesis function quicker. Though I haven't bothered to benchmark it, I'm making that inference. As a result of using pack and unpack it also became apparent that Mimesis could only support files of no greater than 2,147,483,647 bytes. The reason for this is that 32 bit integers are stored as signed integers in PHP, so the largest positive 32 bit integer unpack can return is 2147483647; the functional limit of the database. However, this limit was present before, I just hadn't come to realize it. I thought I was overriding it by using the bcadd functions in Mimesis, but I realize now this was a mistake and using regular addition in place of bcadd also helps speed up Mimesis.

What's next?

Assuming this experimental version functions reliably, I'm going to consider changing all array_map function to array_walk as I imagine that will likely be more memory efficient (thus speeding up operations further).

Maybe completely decide against renameRow as opposed to revising it.

As a very far off hypothetical I may decide that files may not have a functional limit. I'm making an EXTREME guess that using fseek with the SEEK_CUR flag and an offset its possible to traverse files of any given size. The issue would then be how to store data larger than 32 bits and interpret it correctly using the pack/unpack functions. Currently, every integer within the structural file falls between the range of 0 and 2147483647 inclusive, and all operations performed on files presume this to be true.

0 Comments On This Entry