Subscribe to 10 GOTO 10        RSS Feed
***** 1 Votes

Developer Innumeracy: The Exabytes are comming!

Icon 2 Comments
Programmers have to deal with some pretty big numbers. I recently worked on a project that will have to process petabytes of data in a reasonable amount of time. A petabyte is on the magnitude of 1015 Bytes. That is a LOT of data. Now the program I worked on only had to initially process 1.2 petabytes which is small compared to Google which apparently processes about 24 Petabytes per day!!!

But how big is a petabyte? Well lets look at it in terms of size. I can easily bite off 1" of a snickers bar. How long would a snickers bar have to be to allow me 1015 bites?

There are 63360 inches in a mile so 1015/63360 = 1.5782828x1010 miles. The earth has a circumference of 24,859.82 miles so that snickers bar would need to circle the globe 634873 times. Alternately the candy could reach the sun and back 84 time and still have enough left over to go around the earth more than 6687 times!

Maybe an inch is too big a unit for comparison. What does traveling to the sun and back 84 times really mean anyway? Lets look at it another way. In MSWord I was able to fill a page with 44x58=2552 characters in my default font (Times New Roman 11pt 1 inch margins). A piece of normal office paper is about 0.097mm thick, giving (1015/2552 = 3.801x1010 pages. That is roughly 38009.4 kilometers in terms of one HUGE stack of paper. The Kármán line lies at an altitude of 100 km and is generally considered to the be boundary between earth and space... so we could have more than 380 stack of paper that reach all the way to space with 1 petabyte of information.

A petabyte is HUGE number of bytes. Wet, n the world of computing this magnitude is becoming more and more common! It is important for the next generation of programmers to really have a sense of magnitude because computation often deals with very large scales.

One topic a while ago really made me laugh. The author had a small program and was not seeing any output even though no one could spot an error in the logic. Indeed the logic was correct. The problem was that, due to nested loops, the program would have required quite a long time to execute! I myself have, on a number of occasions, put an extra zero or two onto the size of some array and been baffled as to why the program seemed to stop working.

Am I suggesting that every programmer should take the application of Big-O analysis into every project? Algorithmic efficiency should of course be in kept in mind, but even before that there should be a sense of the numbers and magnitudes you are working with.

When you say you want to find all of the primes less than 10,000,000,000 you should have a sense for how big the task is, how much storage you will need, roughly how long the calculation should take and what kind of storage requirements you will have to store the results.

If you want to calculate a trillion digits of pi -- how much will storage cost you?

If you want to convert 1.2 Petabytes of image data from an old raw image format -- what modern formats might you use? Would it be better to store them in a highly compressed format where the time to compress/decompress might be high but the storage cost low, or a less compressed format where the storage costs are higher but the processing can be done quicker? These decisions translate into real dollars.

Is allowing a few bytes in slack space per record really worth the storage costs? Is spending the extra bytes on that Flash video on your home page really worth the bandwidth requirements? When you consider that you may have 100's of thousands or millions of records, or moderately heavy internet traffic these seemingly insignificant costs multiply to significant values.

It only takes 4 mouse clicks to review and approve a transaction, But if each operator is supposed to do this 100's of times per day, and there are 40+ operators how much time (and therefore money) is spent just in refocusing the eyes and moving the mouse cursor over and clicking. How much do those extra clicks cost? How much does that few milliseconds it takes for a new window to popup cost?

Innumeracy may seem like a silly word for an unimportant thing. Who really knows how big a billion is anyway? But software is often a game of multiplication as the number of users, or the number of records, or the number of transactions goes up. The little things that can seem insignificant on your development box or on you team white board can multiply up into huge problems that need to be faced.

Today's challenges might be on the Tarabyte or even Petabyte scale, but Moore's law is relentless and recent advances in electronics show no sign of the assent plateauing any time soon. Google processes 24 Petabytes of data a day -- in 2009 the worlds digital data was estimated at 500 Exabytes (1018) -- a Terabyte drive costs less than 100$ -- and my desktop's CPU has 4 cores, operates at 3Ghz and is already obsolete. I hate to sound like chicken little but, "The exabytes are coming! The exabytes are coming!" and these numbers that used to be "Astronomical" are seeping their way into computing.

So developers, please spend a little time wrapping your head around the numbers. Do some calculations to help yourself bring these numbers into perspective.

And ask yourself: If some feature of my program is "mildly" irritating to users but I expect users to use the program daily -- does "mildly irritating" build up to "f*&^ing annoying"?



*BTW: if Google is processing 24 petabytes a day, then it would take about 20833.3 days to reach 500 EB so it would take about 51 years for Google to process 500EB's of data! (whats the betting they reach 500EB in half that?)

2 Comments On This Entry

Page 1 of 1

KYA 

26 May 2010 - 09:39 AM
How awesome is math. Very.
1

Dogstopper 

26 May 2010 - 02:49 PM
Math is so great!
1
Page 1 of 1

June 2020

S M T W T F S
 12 3 456
78910111213
14151617181920
21222324252627
282930    

Recent Entries

Search My Blog

1 user(s) viewing

1 Guests
0 member(s)
0 anonymous member(s)