Requirements: A Graphing Calculator that can run regressions with r and r2 values, preferably logarithmic, linear, quadratic, cubic, and exponential regressiosns; and a compiler to test your algorithm.
So to start off, we need to keep two things in mind to validate our statistical claim. We need the sample to be a Simple Random Sample (SRS), and we need our data to be normally distributed. If you are unaware of what an SRS is, it basically means that elements for our sample are being selected completely randomly (or in the case of computers, completely pseudo-randomly) without any bias towards or against a particular element.
As for the normal distribution, this is a little different than in standard sampling distributions. When you do a sampling distribution for means, you notice that as the number of samples --> infinity, you begin to approach or approximate a normal distribution. However, with regression, we rely on our residuals to tell us this. Basically, if your residuals are obviously patterned (ie., clustered, form a shape, etc.) when you graph them, then your sample is not normally distributed. However, if your residuals are randomly distributed, then there is no reason to believe that this model isn't the best model out there, so you can assume normality.
So let's talk about gathering our sample some more. In order for us to safely assume normality, we should take 30-40 samples at a minimum. These samples consist of one trial for our algorithm, with a sample size of n. Since we are looking for a trend, we need varying sample sizes including small samples as well as large samples (anywhere from n = 10 to n >= 12,000) to get a good idea of the trend.
Once you collect your data, plug it into your graphing calculator with list1 (the x-coordinate list) holding the sample sizes or the size of the problem, and list2 (the y-coordinate list) holding the number of steps required to produce the solution. Now finding the best equation may take a little trial and error, as you will have to try multiple regression calculators to find the best one. Here are some disqualifiers, though:
- The |r| < .70: This means that there is a weak correlation between the size of the problem and the number of steps required to produce a solution, or at least that there is probably a better parent graph. So for example, if you have an obviously exponential correlation, you'll have a low |r| value if you try to run a linear regression on such data.
- The graph of the residuals forms an obvious pattern. This means that the equation does not take into account all elements, so there is probably a better model.
Also, as the graphing calculators don't have an "N log(n) regression," you will have to utilize the logarithmic regression and extrapolate O(n log(n)) from the rule of logs that states: log(u^x) = x log(u).
So which equation do I choose? Choose the equation with the highest |r| and r2 values, as those indicate a strong correlation of the model to the data, as well as the randomly distributed residuals, as this indicates that our sample comes from a normal distribution.
Once we choose the best regression equation, we simply take the power of the function (with the exception of O(n log(n)))) and use that as our Big-O. So if we have a quadratic function, then our Big-O is O(n^2).