Page 1 of 1

A case for parallel arrays Rate Topic: -----

#1 Paul-  Icon User is offline

  • D.I.C Regular
  • member icon

Reputation: 61
  • View blog
  • Posts: 260
  • Joined: 11-December 09

Posted 16 March 2010 - 08:28 AM

When working with collections of records, one can choose from two main types of data representation: parallel arrays of individual fields, and arrays of objects of a record class. In this tutorial I would like to point out some criteria for choosing one over the other.

In the parallel arrays approach, each record is broken down into its component fields, and arrays are assembled for each field across the collection. For example, consider a store keeping track of purchases, where for each purchase the name of the item, price, and day of the week the purchase was made are recorded. The data would be stored in 3 arrays, one each for names, prices, and days. All arrays would be of the same length, equal to the total number of purchases. (For another example, including Java code, see the DIC tutorial http://www.dreaminco...howtopic=120420.)

In contrast, in the array of objects approach, a class is defined describing the information associated with one record. Then the collection of records can be represented as an array of objects of the record class. In the store example, a "purchase" class can be defined, containing a field each for name, price, and day. The entire store data is an array of purchase objects, with one object for each purchase.

It has been argued that the array of objects approach is superior to parallel arrays (e.g. http://www.dreaminco...howtopic=147196). Arrays of objects are clearly more consistent with object-oriented programming concepts. The code is typically more elegant, although this aspect largely depends on the personal approach of the programmer. In turn, more elegant code is easier to understand, less error-prone, and easier to maintain and extend.

So should one always choose arrays of objects and forget about parallel arrays? One potential hidden cost of arrays of object is worse execution time. Below I describe an experiment that compares execution times of processing of the same data stored in 4 different modes.

The type of data used was purchase records from a store, as described above. The item names I made longer than typically necessary, to simulate the case of larger record sizes. Prices were stored as doubles, and days of the week were represented by an enumeration. The 4 modes of data representation were:
  • Subclass Vector: Define a “Purchase” class with 3 data fields. The collection is a vector of Purchases.
  • Subclass Array: Define a “Purchase” class with 3 data fields. The collection is a plain array of Purchases.
  • Parallel Vector: Use 3 vectors, one each for name, price, and day.
  • Parallel Array: Use 3 plain arrays, one each for name, price, and day.


To measure the time cost of data access, the mean item price was computed for purchases that were made on a Sunday. For each mode of representation the same data was loaded, and a vector of indexes specifying Sundays was provided. Total execution time for calculating the mean item price was determined for different list lengths, ranging from 1,000 to 100,000 records.

The source code can be found here:
Attached File  store-records.zip (3.75K)
Number of downloads: 230

It is composed of the following files:

WeekDay.java
An enumeration that defines the days of the week.

StoreRecords.java
Interface to a list of store records. It allows adding items to the list, getting the size of the list, and retrieving data for single records, based on list index.

StoreRecordsSubclassVector.java
StoreRecordsSubclassArray.java
StoreRecordsParallelVector.java
StoreRecordsParallelArray.java

The 4 implementations of the StoreRecords interface.

StoreStats.java
Class containing main(). It defines 4 StoreRecords objects, one for each type of data representation. The main program loop iterates through a set of list lengths. For each, random data records are generated, each being added to the 4 StoreRecords objects. A vector of indexes denoting Sunday purchases is then derived. The execution time for computing the mean item price of Sunday purchases is measured for each data representation. This step is actually repeated multiple times, to provide measurable times on the order of hundreds of miliseconds or more.

The results of 5 runs of the timing program are shown in the following plot.
Attached Image

When the data was stored in a vector, read access times were essentially the same, whether the records were represented as a parallel array, or list of objects. The parallel array version was 3% faster, but the difference was not statistically significant.

An important gain in speed was observed when data was stored in plain arrays. For arrays of objects, switching from a vector representation to plain arrays shortened time by 36%. In contrast, in the parallel array case, the gain in speed was 60% when going from vectors to plain arrays. Considering only the faster cases of plain arrays, parallel arrays were 18% faster than lists of objects.

The above data suggest some rules of thumb for choosing the best data representation for your own programs. When the focus of a project is on facilitating the software development process, using arrays of objects is the better option. It allows for all benefits of a clearer object-oriented design, including quick development and testing, and easier maintenance. However, when speed of repeated access to a long list of records is critical to the application, a parallel array approach will be beneficial. By separating out computationally intensive parts of a program into their own classes, and implementing them with underlying parallel arrays, a balance can be struck between clear code and speed.

Is This A Good Question/Topic? 2
  • +

Replies To: A case for parallel arrays

#2 macosxnerd101  Icon User is online

  • Self-Trained Economist
  • member icon




Reputation: 10561
  • View blog
  • Posts: 39,072
  • Joined: 27-December 08

Posted 17 March 2010 - 08:45 AM

One confounding variable for this is that Vectors are synchronized, which makes this structure less efficient in evironments without multiple Threads. You might get more realistic results if you use ArrayLists, which like Vectors, are dynamic arrays, but are unsynchronized, so they run faster.

Quote

It has been argued that the array of objects approach is superior to parallel arrays (e.g. http://www.dreaminco...wtopic=147196). Arrays of objects are clearly more consistent with object-oriented programming concepts. The code is typically more elegant, although this aspect largely depends on the personal approach of the programmer. In turn, more elegant code is easier to understand, less error-prone, and easier to maintain and extend.

In terms of using classes vs. parallel arrays, the major advantage is encapsulation (as well as extensibility and modularity, as you said). Basically, this means you can more tightly associate attributes with variables through a blueprint design. This design becomes superior to parallel arrays when manipulating your data, specifically for purposes like sorting, as your attributes won't become mismatched. However, when you use parallel arrays, it is very easy to mismatch corresponding elements.

Overall, good tutorial. Keep up the good work! :^:
Was This Post Helpful? 0
  • +
  • -

#3 Paul-  Icon User is offline

  • D.I.C Regular
  • member icon

Reputation: 61
  • View blog
  • Posts: 260
  • Joined: 11-December 09

Posted 19 March 2010 - 08:14 AM

View Postmacosxnerd101, on 17 March 2010 - 07:45 AM, said:

One confounding variable for this is that Vectors are synchronized, which makes this structure less efficient in evironments without multiple Threads. You might get more realistic results if you use ArrayLists, which like Vectors, are dynamic arrays, but are unsynchronized, so they run faster.


Thanks for pointing out the difference between vectors and arraylists. I did some additional tests using arraylists and they turned out to behave in between vectors and plain arrays, but closer to arrays. The speed gain from using parallel arrays was less, but comparable to that seen for the plain array case. Also overall execution time was slightly longer.
Was This Post Helpful? 0
  • +
  • -

#4 baavgai  Icon User is offline

  • Dreaming Coder
  • member icon

Reputation: 5823
  • View blog
  • Posts: 12,675
  • Joined: 16-October 07

Posted 19 March 2010 - 09:01 AM

Your test focuses on retrieving one item type at a time, which heavily weighs the test in favor of a single array. You force your object implementation to make an extra method call for every point in the implementation, also reducing speed. You might also want to try sorting or just removing or swapping items...

Even without the stacked deck, even if an array of objects was marginally slower compared to a parallel implementation, I'd still prefer it. The real cost is when a program needs to be modified to include a couple extra fields in the record. A savings in runtime milliseconds means little compared to a savings in man hours.

Object Oriented Programming is an understood compromise. It trades the overhead of organization and abstraction for the raw speed of messing about at lower level. The gain is maintainable code. It is a trade that most programmers are more than willing to make. If you find that compromise unpalatable, then Java is not the language for you.

This post has been edited by baavgai: 19 March 2010 - 09:02 AM

Was This Post Helpful? 0
  • +
  • -

#5 Paul-  Icon User is offline

  • D.I.C Regular
  • member icon

Reputation: 61
  • View blog
  • Posts: 260
  • Joined: 11-December 09

Posted 19 March 2010 - 10:15 AM

baavgai: I appreciate you taking the time to read my post and commenting on it. My intention was merely to point out that choices you make as a programmer involve trade-offs. As long as people understand the compromises they make and the available alternatives, everything is fine. I can also empathize with programmers being reluctant to spend 10x time for a 1% gain in speed. However, one must consider the circumstances. For example, if you are working on a successful product, used by millions of people every day, a 1% increase in speed may be worth spending time on. The time savings for users add up quickly, and can make an enormous difference overall (hint, hint, a popular OS).

Btw, nice signature!

This post has been edited by Paul-: 19 March 2010 - 10:49 AM

Was This Post Helpful? 0
  • +
  • -

#6 macosxnerd101  Icon User is online

  • Self-Trained Economist
  • member icon




Reputation: 10561
  • View blog
  • Posts: 39,072
  • Joined: 27-December 08

Posted 21 March 2010 - 12:30 PM

If you could have your word processor open up milliseconds faster, would you really notice? I personally wouldn't. As baavgai said, maintainable code is more important than marginal increase in runtime. And honestly, no programmer is going to be writing code that runs significantly faster that is unmaintainable. The program would collapse like a NetBeans GUI (sooner rather than later), and you as the programmer would have some irate users knocking on your doorstep. In addition, the whole point of OOP is for reliability, which is accomplished mainly through encapsulation. Parallel arrays defeats the purpose of any encapsulation.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1