In the parallel arrays approach, each record is broken down into its component fields, and arrays are assembled for each field across the collection. For example, consider a store keeping track of purchases, where for each purchase the name of the item, price, and day of the week the purchase was made are recorded. The data would be stored in 3 arrays, one each for names, prices, and days. All arrays would be of the same length, equal to the total number of purchases. (For another example, including Java code, see the DIC tutorial http://www.dreaminco...howtopic=120420.)
In contrast, in the array of objects approach, a class is defined describing the information associated with one record. Then the collection of records can be represented as an array of objects of the record class. In the store example, a "purchase" class can be defined, containing a field each for name, price, and day. The entire store data is an array of purchase objects, with one object for each purchase.
It has been argued that the array of objects approach is superior to parallel arrays (e.g. http://www.dreaminco...howtopic=147196). Arrays of objects are clearly more consistent with object-oriented programming concepts. The code is typically more elegant, although this aspect largely depends on the personal approach of the programmer. In turn, more elegant code is easier to understand, less error-prone, and easier to maintain and extend.
So should one always choose arrays of objects and forget about parallel arrays? One potential hidden cost of arrays of object is worse execution time. Below I describe an experiment that compares execution times of processing of the same data stored in 4 different modes.
The type of data used was purchase records from a store, as described above. The item names I made longer than typically necessary, to simulate the case of larger record sizes. Prices were stored as doubles, and days of the week were represented by an enumeration. The 4 modes of data representation were:
- Subclass Vector: Define a “Purchase” class with 3 data fields. The collection is a vector of Purchases.
- Subclass Array: Define a “Purchase” class with 3 data fields. The collection is a plain array of Purchases.
- Parallel Vector: Use 3 vectors, one each for name, price, and day.
- Parallel Array: Use 3 plain arrays, one each for name, price, and day.
To measure the time cost of data access, the mean item price was computed for purchases that were made on a Sunday. For each mode of representation the same data was loaded, and a vector of indexes specifying Sundays was provided. Total execution time for calculating the mean item price was determined for different list lengths, ranging from 1,000 to 100,000 records.
The source code can be found here:
Number of downloads: 216
It is composed of the following files:
An enumeration that defines the days of the week.
Interface to a list of store records. It allows adding items to the list, getting the size of the list, and retrieving data for single records, based on list index.
The 4 implementations of the StoreRecords interface.
Class containing main(). It defines 4 StoreRecords objects, one for each type of data representation. The main program loop iterates through a set of list lengths. For each, random data records are generated, each being added to the 4 StoreRecords objects. A vector of indexes denoting Sunday purchases is then derived. The execution time for computing the mean item price of Sunday purchases is measured for each data representation. This step is actually repeated multiple times, to provide measurable times on the order of hundreds of miliseconds or more.
The results of 5 runs of the timing program are shown in the following plot.
When the data was stored in a vector, read access times were essentially the same, whether the records were represented as a parallel array, or list of objects. The parallel array version was 3% faster, but the difference was not statistically significant.
An important gain in speed was observed when data was stored in plain arrays. For arrays of objects, switching from a vector representation to plain arrays shortened time by 36%. In contrast, in the parallel array case, the gain in speed was 60% when going from vectors to plain arrays. Considering only the faster cases of plain arrays, parallel arrays were 18% faster than lists of objects.
The above data suggest some rules of thumb for choosing the best data representation for your own programs. When the focus of a project is on facilitating the software development process, using arrays of objects is the better option. It allows for all benefits of a clearer object-oriented design, including quick development and testing, and easier maintenance. However, when speed of repeated access to a long list of records is critical to the application, a parallel array approach will be beneficial. By separating out computationally intensive parts of a program into their own classes, and implementing them with underlying parallel arrays, a balance can be struck between clear code and speed.