ANALYSIS OF ALGORITHMS STUDY GUIDE
Empirical analysis.
If the running time of our program (approximately) obeys a power law T(n) ~ anb,
we can use a doubling hypothesis to estimate the coefficients a and b.
Tilde notation.
We say that f(n) ~ g(n) if f(n)/g(n) converges to 1 as n gets large.
This is a general concept about mathematical functions
and is not restricted to running time, memory, or any other specific domain.
Cost model.
For theoretical analyses of running time in COS 226, we will assume a cost model,
namely that some particular operation (or operations) dominates the running time of a program.
Then, we express the running time in terms of the total number of that operation
as a function of the input size.
To simplify things, we usually give this frequency count in tilde notation.
Order of growth.
If we have two functions f(n) and g(n), and f(n) ~ c g(n) for some constant c > 0,
we say the order of growth of f(n) is g(n).
Typically g(n) is one of the following functions:
1, log n, n, n log n, n2, n3, or 2n.
Worst-case order of growth isn't everything.
Just because one algorithm has a better order of growth than other does not
mean that it is faster in practice. We will encounter some notable counterexamples,
including quicksort vs. mergesort.
Memory analysis.
Know how to calculate the memory utilization of a class with the 64-bit memory model
from the textbook.
Theoretical and empirical analysis.
Hypotheses generated through theoretical analysis (or guesswork like our power law
assumption) should be validated with data before being fully trusted.
Recommended Problems
C level
- Textbook 1.4.4
-
Suppose that you collect the following timing data for a program as a function of the input size N.
N time
125 0.03 sec
1,000 1.00 sec
8,000 32.00 sec
64,000 1,024.00 sec
512,000 32,768.00 sec
Estimate the running time of the program (in seconds) as a function of N and use tilde
notation to simplify your answer.
Hint: recall that logb a = lg a/ lg b.
Answers
T(N) = (1 / 100,000)*N5/3
When N increases by a factor of 8, the running time increases by a factor of 32. Thus,
T(N) = aNb, where b = log8 32 = lg 32/ lg 8 = 5/3. Since T(1000) = 1.00, we have 1.00 = a * 10005/3, which implies a = 1 / 100000
B level
- Textbook 1.4.5
- Spring 2012 Midterm, #1
-
For each of the functions shown, give the best order of
growth of the running time.
public static int f1 (int n) {
int x = 0;
for (int i = 0; i < n; i++)
x++;
return x;
}
public static int f2(int n) {
int x = 0;
for (int i = 0; i < n; i++)
for (int j = 0; j < i*i; j++)
x++;
return x;
}
public static int f3 (int n) {
if (n <= 1) return 1;
return f3(n-1) + f3(n-1)
}
public static int f4 (int n) {
if (n <= 1) return 1;
return f4(n/2) + f4(n/2);
}
public static int f5 (int n) {
if (n <= 1) return 1;
return f1(n) + f5(n/2) + f5(n/2);
}
public static void f6(int n) {
// 1<<i is the same as 2^i.
// Ignore integer overflow.
// 1<<i takes constant time.
for (int i = 0; i < n; i = 1 << i);
}
Answers
- f1 is Linear.
- f2 is N^3 because each iteration of the inner loop is
quadratic in the outer loop variable.
The simplest way to do this is to realize it is just the integration of i^2.
- f3 is 2^N. Each iteration spawns two iterations. Thus by the time we get to the bottom
level(where n=1), we've produced 2! total calls of 3.
- f4 is linear. This is similar to the pattern that we saw in Mergesort and Quicksort,
except that each recursive call does only a constant amount of work instead of a linear amount.
It is the same as the pattern for bottom up heapification. At the top level,we do 1 unit of
work; at the 2nd level,we do 2 units of work; at the 3rd level, we do 4 units, etc. The total
amount of work is thus given by 1 + 2 + 4 + 8 + ? + ?. This sum is linear in N.
- f5 is N log N. This is the exact same pattern as Mergesort and Quicksort. If you want to think
of it as a sum, then it's N + N + N + ...N, which are log(base 2)N summands.
- f6 is log* N. After the first iteration, i = 2. After the second iteration, i = 2^2. After the third iteration, i = 2^2^2, etc.
This takes Log*N steps to reach N. If you weren't totally sure, you could have also
observed that Log*N was the only answer between constant and LogN.
-
Consider the following three algorithms:
- Algorithm 1 solves problems of size N by recursively dividing them into 2 sub-problems of size N/2
and combining the results in time c (where c is some constant).
- Algorithm 2 solves problems of size N by solving one sub-problem of size N/2 and peforming some processing taking some constant time c.
- Algorithm 3 solves problems of size N by solving two sub-problems of size N/2 and performing a linear amount (i.e., cN where c is some constant) of extra work.
(a) For each algorithm, write down a recurrence relation showing how T(N), the running time on an instance of size N, depends on the running time of a smaller instance.
(b) For each recurrence relation, what is the running time for each T(N) (use tilde notation)?
Answers
Algorithm 1: T(N) = 2T(N/2) +c
Algorithm 2: T(N) = T(N/2) +c
Algorithm 3: T(N) = 2T(N/2) +cN
- Suppose we wanted to simulate percolation in a cube with N sites on a side, with each site connected to
its neighbors up, down, left, right, forward, and back. If we used WeightedQuickUnionUF, what would be
the order of growth of the expected running time, as a function of N?
Answers
Algorithm 1: c N
Algorithm 2: c log N
Algorithm 3: c N log N
A level
- The code below operates on bacterial genomes of
approximately 1 megabyte in size.
int N = Integer.parseInt(args[0]);
String[] genomes = new String[N];
for (int i = 0; i < N; i++) {
In gfile = new In("genomeFile" + i + ".txt");
genomes[i] = gfile.readString();
}
for (int i = 1; i < N; i++) {
for (int j = i; j > 0; j--) {
if (genomes[j-1].length() > genomes[j].length())
exch(genomes, j-1, j);
else break;
}
}
- What is the theoretical order of growth of the worst case running time as a function
of N?
- A table of runtimes for the program above is given below. Approximate the empirical
run time in tilde notation as a function of N. Do not leave your answer in terms of
logarithms.
N Time (s)
1 0.15
2 0.14
4 0.19
8 0.41
16 0.85
32 1.66
64 3.38
- Explain any discrepancy between your answers to (a) and (b). Be as specific and
detailed as possible.
Answers
- N2. Reading in the genomes is linear time. The for loops are just
insertion sort, which is N2 in the worst-case (where the break
statement never occurs).
- 0.05N (ok if left in terms of a fraction)
- Multiple acceptable answers:
- The input may not be a worstcase input (e.g. already sorted)
- The time to read in a 64 megabytes of genomes is much larger
than the time to perform 64^2 string length compares and
64^2 swaps of 8 byte references.
- Partial credit: N is too small / time is too short