Analysis of Algorithms Study Guide

ANALYSIS OF ALGORITHMS STUDY GUIDE

Empirical analysis. If the running time of our program (approximately) obeys a power law T(n) ~ an^b, we can use a doubling hypothesis to estimate the coefficients a and b.

Tilde notation. We say that f(n) ~ g(n) if f(n)/g(n) converges to 1 as n gets large. This is a general concept about mathematical functions and is not restricted to running time, memory, or any other specific domain.

Cost model. For theoretical analyses of running time in COS 226, we will assume a cost model, namely that some particular operation (or operations) dominates the running time of a program. Then, we express the running time in terms of the total number of that operation as a function of the input size. To simplify things, we usually give this frequency count in tilde notation.

Order of growth. If we have two functions f(n) and g(n), and f(n) ~ c g(n) for some constant c > 0, we say the order of growth of f(n) is g(n). Typically g(n) is one of the following functions: 1, log n, n, n log n, n², n³, or 2ⁿ.

Worst-case order of growth isn't everything. Just because one algorithm has a better order of growth than other does not mean that it is faster in practice. We will encounter some notable counterexamples, including quicksort vs. mergesort.

Memory analysis. Know how to calculate the memory utilization of a class with the 64-bit memory model from the textbook.

Theoretical and empirical analysis. Hypotheses generated through theoretical analysis (or guesswork like our power law assumption) should be validated with data before being fully trusted.

Recommended Problems

C level

Textbook 1.4.4
Fall 2011 Midterm, #2

B level

Textbook 1.4.5
Spring 2012 Midterm, #1

For each of the functions shown, give the best order of growth of the running time.

 public static int f1 (int n) {
    int x = 0;
    for (int i = 0; i < n; i++)
        x++;
        return x;
    }

 public static int f2(int n) {
    int x = 0;
    for (int i = 0; i < n; i++)
        for (int j = 0; j < i*i; j++)
           x++;
        return x;
    }

 public static int f3 (int n) {
    if (n <= 1) return 1;
    return f3(n-1) + f3(n-1)
 }

 public static int f4 (int n) {
    if (n <= 1) return 1;
    return f4(n/2) + f4(n/2);
 }

 public static int f5 (int n) {
    if (n <= 1) return 1;
    return f1(n) + f5(n/2) + f5(n/2);
 }

 public static void f6(int n) {
    // 1<<i is the same as 2^i.
    // Ignore integer overflow.
    // 1<<i takes constant time.
    for (int i = 0; i < n; i = 1 << i);
 }

 Answers


    f1 is Linear.
    f2 is N^3 because each iteration of the inner loop is
    quadratic in the outer loop variable.
    The simplest way to do this is to realize it is just the integration of i^2.
    f3 is 2^N. Each iteration spawns two iterations. Thus by the time we get to the bottom 
level(where n=1), we've produced 2! total calls of 3.
    f4 is linear. This is similar to the pattern that we saw in Mergesort and Quicksort, 
except that each recursive call does only a constant amount of work instead of a linear amount. 
It is the same as the pattern for bottom up heapification. At the top level,we do 1 unit of
    work; at the 2nd level,we do 2 units of work; at the 3rd level, we do 4 units, etc. The total
    amount of work is thus given by 1 + 2 + 4 + 8 + ? + ?. This sum is linear in N.
    f5 is N log N. This is the exact same pattern as Mergesort and Quicksort. If you want to think 
of it as a sum, then it's N + N + N + ...N, which are log(base 2)N  summands.
    f6 is log* N. After the first iteration, i = 2. After the second iteration, i = 2^2. After the third iteration, i = 2^2^2, etc.
    This takes Log*N steps to reach N. If you weren't totally sure, you could have also
    observed that Log*N was the only answer between constant and LogN.

Consider the following three algorithms:
1. Algorithm 1 solves problems of size N by recursively dividing them into 2 sub-problems of size N/2 and combining the results in time c (where c is some constant).
2. Algorithm 2 solves problems of size N by solving one sub-problem of size N/2 and peforming some processing taking some constant time c.
3. Algorithm 3 solves problems of size N by solving two sub-problems of size N/2 and performing a linear amount (i.e., cN where c is some constant) of extra work.
(a) For each algorithm, write down a recurrence relation showing how T(N), the running time on an instance of size N, depends on the running time of a smaller instance.
(b) For each recurrence relation, what is the running time for each T(N) (use tilde notation)?
Answers
```
  Algorithm 1: T(N) = 2T(N/2) +c
  Algorithm 2: T(N) = T(N/2) +c
  Algorithm 3: T(N) = 2T(N/2) +cN
```
Suppose we wanted to simulate percolation in a cube with N sites on a side, with each site connected to its neighbors up, down, left, right, forward, and back. If we used WeightedQuickUnionUF, what would be the order of growth of the expected running time, as a function of N?
Answers
```
  Algorithm 1: c N
  Algorithm 2: c log N
  Algorithm 3: c N log N
```

A level

The code below operates on bacterial genomes of approximately 1 megabyte in size.
```
    int N = Integer.parseInt(args[0]);
    String[] genomes = new String[N];
    for (int i = 0; i < N; i++) {
        In gfile = new In("genomeFile" + i + ".txt");
        genomes[i] = gfile.readString();
    }
    for (int i = 1; i < N; i++) {
        for (int j = i; j > 0; j--) {
            if (genomes[j-1].length() > genomes[j].length())
                exch(genomes, j-1, j);
            else break;
        }
    }
```
1. What is the theoretical order of growth of the worst case running time as a function of N?
2. A table of runtimes for the program above is given below. Approximate the empirical run time in tilde notation as a function of N. Do not leave your answer in terms of logarithms.
```
      N Time (s)
      1 0.15
      2 0.14
      4 0.19
      8 0.41
      16 0.85
      32 1.66
      64 3.38
      
```
3. Explain any discrepancy between your answers to (a) and (b). Be as specific and detailed as possible.
Answers
1. N2. Reading in the genomes is linear time. The for loops are just insertion sort, which is N2 in the worst-case (where the break statement never occurs).
2. 0.05N (ok if left in terms of a fraction)
3. Multiple acceptable answers:
  1. The input may not be a worstcase input (e.g. already sorted)
  2. The time to read in a 64 megabytes of genomes is much larger than the time to perform 64^2 string length compares and 64^2 swaps of 8 byte references.
  3. Partial credit: N is too small / time is too short