3.5   Advanced Sorting


This section under major construction.

Sorting algorithms and priority queues are critical in a broad variety of applications.

A prime reason why sorting is so useful is that it is much easier to search for an element in a sorted list than an unsorted one. Given someone's first and last name, it is easy to find their phone number using a phone book since the entries are sorted by last name. A digital music player organizes song files by artist name or song title; a search engine displays search results in descending order of importance; a spreadsheet displays columns sorted by a particular field; an RSS reader sorts news by date and time; a matrix-processing package sorts the real eigenvalues of a symmetric matrix in descending order. Other tasks are also made easier once a file is in sorted sorted: look up an entry in the index in the back of this book, find the median element, the closest pair, statistical outliers, remove duplicates in a mailing list, or detect double registrations in a voting precinct.

Sorting also arises as a critical subroutine in many applications that appear to have nothing to do with sorting at all including: data compression (see the Burrows-Wheeler programming assignment), computer graphics (convex hull, closest pair), computational biology (longest common substring discussed below), supply chain management (schedule jobs to minimize weighted sum of completion times), combinatorial optimization (Kruskal's algorithm), social choice and voting (Kendall's tau distance),

Historically, sorting was most important for commercial applications, but sorting also plays a major role in the scientific computing infrastructure. NASA and the fluids mechanics community use sorting to study problems in rarefied flow; these collision detection problems are especially challenging since they involve ten of billions of particles and can only be solved on supercomputers in parallel. Similar sorting techniques are used in some fast N-body simulation codes. Another important scientific application of sorting is for load balancing the processors of a parallel supercomputers. Scientists rely on clever sorting algorithm to perform load-balancing on such systems.

Load balancing.

Suppose that you have N identical processors and M job to complete. Task j requires tj seconds of processing time. Your goal is the schedule all of the tasks so that the time at which the last job completes is as early as possible. List scheduling = consider jobs in arbitrary order and assign job j to machine that is least busy. The LPT rule (longest processing time first) is to consider the jobs in descending order of time, and assign a job to the machine that becomes available first. Although this algorithm does not necessarily yield an optimal schedule (It is NP-hard so we do not expect a practical solution. See Section xyz), the resulting solution is guaranteed to be within 33% of the best possible. To implement this algorithm, we first sort the jobs. Then we maintain a priority queue of N elements, corresponding to each processor, where the priority is the sum of the processing times. At each step, we delete the processor with the minimum priority, add the next job to the processor, and re-insert that processor into the priority queue. Program Processor.java implements this strategy.

Comparable. Date.java implements a simple date ADT using month, day, and year. It implements the Comparable interface using the natural ordering on dates.

Comparator. What happens when we want a data type to have more than one ordering, depending on the application? For example, we might have a Person data type which we might want to sort by last name, age, bank account, height, or IQ. In such situations, you can define a class or subclass that implements the Comparator interface.

interface Comparator<Key> {
    int compare(Key x, Key y);
}
It is easy to modify our sorting routines to work with comparators. Program Insertion.java is a revised version of insertion sort that works with comparators.

Transaction. Data type for business transactions (date, account number, name). Want to be able to sort by either of the three fields. Program Transaction.java represents a transaction using an id (long), name (String) and date (Date).

Program Customer.java is a data type for customers.

Java uses a similar strategy in its String library. The compareTo function is case-sensitive. If you want to sort with a case-insensitive comparison function, you can use String.CASE_INSENSITIVE_ORDER. This is a Comparator that use the compareToIgnoreCase method.

System sort. Java provides a sysem sort for sorting primitive types and objects that implement the Comparable interface. It relies on two classic sorting algorithms - quicksort and mergesort - which we will learn about in Sections X and Y. Program SystemSort.java illustratese how to use it. The algorithm for sorting strings (and other objects) is a tuned version of mergesort. It is guaranteed to be stable and run in time proportional to N log N. Program SystemSort.java illustrates how to use it. The algorithm for primitive types is a variant of 3-way quicksort developed by Bentley and McIlroy. It is extremely efficient for most inputs that arise in practice, including inputs that are already sorted. However, using a clever techinique described by M. D. McIlroy in A Killer Adversary for Quicksort, it is possible to construct pathological inputs that make the system sort run in quadratic time. Even worse, it is blows the function call stack. To see Java's sorting library break, here are some killer inputs of varying sizes: 10,000, 20,000, 50,000, 100,000, 250,000, 500,000, and 1,000,000. You can test them out using the program IntegerSort.java which takes a command line input N, reads in N integers from standard input, and sorts them using the system sort.

Java implements its sorting libraries twice, once using Comparable objects and once for dealing with Comparator. For example, you can sort an array s of strings using Java's case insensitive comparator function via Arrays.sort(s, String.CASE_INSENSITIVE_ORDER). We illustrate below how to implement bubblesort using a Comparator, but limit our future discussions to Comparable objects, noting that it is straightforward to add this additional functionality.

Java priority queue library implementation.

Program SystemPQ.java implements our min priority queue interface using the PriorityQueue library in Java 1.5.

Stability.

In-place.

In-place = O(1) extra space. In-situ = O(log N) extra space.

Comparison of sorting algorithms.

Sorting summary.

Computational complexity.

N log N lower bound under decision tree model of computation. Applies to randomized algorithms.


Molecular Dynamics Simulation of Hard Spheres

Simulate the motion of N colliding particles according to the laws of elastic collision using event-driven simulation. Such simulations are widely used in molecular dynamics (MD) to understand and predict properties of physical systems at the particle level. This includes the motion of molecules in a gas, the dynamics of chemical reactions, atomic diffusion, sphere packing, the stability of the rings around Saturn, the phase transitions of cerium and cesium, one-dimensional self-gravitating systems, and front propagation. The same techniques apply to other domains that involve physical modeling of particle systems including computer graphics, computer games, and robotics. We will revisit some of these issues again in Chapter 7.

a good reference

Hard sphere model. The hard sphere model (billiard ball model) is an idealized model of the motion of atoms or molecules in a container. We focus on the two-dimensional version called the hard disc model. The salient properties of this model are listed below.

This simple model plays a central role in statistical mechanics, a field which relates macroscopic observables (e.g., temperature, pressure, diffusion constant) to microscopic dynamics (motion of individual atoms and molecules). Maxwell and Boltzmann used the model to derive the distribution of speeds of interacting molecules as a function of temperature; Einstein used it to explain the Brownian motion of pollen grains immersed in water.

Simulation. There are two natural approaches to simulating the system of particles.

Collision prediction. We review the formulas that specify if and when a particle will collide with the boundary or with another particle, assuming all particles travel in straight-line trajectories.

Collision resolution. In this section we present the physical formulas that specify the behavior of a particle after an elastic collision with a reflecting boundary or with another particle. For simplicity, we ignore multi-particle collisions. There are three equations governing the elastic collision between a pair of hard discs: (i) conservation of linear momentum, (ii) conservation of kinetic energy, and (iii) upon collision, the normal force acts perpendicular to the surface at the collision point. Physics-ly inclined students are encouraged to derive the equations from first principles; the rest of you may keep reading.

Particle collision simulation in Java. Our implementation involves the following data types: MinPQ.java, Particle.java, Event.java, and CollisionSystem.

Data files. We use the following data format. The first line contains the number of particles N. Each of the remaining N lines consists of 6 real numbers (position, velocity, mass, and radius) followed by three integers (red, green, and blue values of color). You may assume that all of the position coordinates are between 0 and 1, and the color values are between 0 and 255. Also, you may assume that none of the particles intersect each other or the walls.

N                            
rx ry vx vy mass radius r g b
rx ry vx vy mass radius r g b
rx ry vx vy mass radius r g b
rx ry vx vy mass radius r g b
Here are some sample data files.

Q + A

Q. Why doesn't the Java library use a randomized version of quicksort?

A. Good question. At the very least, the library should cutoff to some guaranteed N log N algorithm if it "realizes" it is in trouble. Programmers may want their libraries to be deterministic for debugging. But the library only uses quicksort for primitive types when stability is not an issue, so the programmer probably couldn't observe the randomness, except in the running time.

Q. Why does the Java library use mergesort for sorting objects and quicksort for sorting primitive types?

A. With objects, stability might be desirable or expected. With primitive types, stability is irrelevant since there is no associated data.

Q. What's the best way to create comparators for non-English alphabets, e.g., for accents, umlauts, and pre-composed character like ch in Spanish?

A. Use Java's Collator library. For example in UNICODE, Rico occurs lexicographically before Réal, but in French, Réal occurs first.

import java.util.Arrays;
import java.text.Collator;
...
Arrays.sort(words, Collator.getInstance(Locale.FRENCH));

Q. Any specialized data structures for dealing with colliding particle type simulations?

A. Google for kinetic heaps or kinetic data structures. (Guibas)

Exercises

  1. Suppose that you have a file that is sorted, except for two or three items. Which of the sorting method (insertion sort, selection sort, mergesort, quicksort) would you use and why?
  2. Consider the following implementation of the compareTo method for String. How does the third line help with efficiency?
    public int compareTo(String t) {
       String s = this;
       if (s == t) return 0;
       int n = Math.min(s.length(), t.length());
       for (int i = 0; i < n; i++) {
          if      (s.charAt(i) < t.charAt(i)) return -1;
          else if (s.charAt(i) > t.charAt(i)) return +1;
       }
       return s.length() - t.length();
    }
    

    Answer: avoid directly comparing all n characters if s and t are references to the same string.

  3. What is the flaw with the following implementation of the Comparable interface for account balances.
    public class Customer implements Comparable<Customer> {
        private String name;
        private double balance;
        
        public int compareTo(Customer b) {
            Customer a = this;
            if (a.balance < b.balance - 0.005) return -1;
            if (a.balance > b.balance + 0.005) return +1; 
            return 0;
       }
    }
    

    Answer: violates the Comparable contract. It is possible that a.compareTo(b) and b.compareTo(c) are both 0, but a.compareTo(c) is positive or negative.

  4. What is the flaw with the following implementation of the Comparable interface for 32-bit int account balances?
    public class Customer implements Comparable<Customer> {
        private String name;
        private int balance;
        
        public int compareTo(Customer b) {
            Customer a = this;
            return a.balance - b.balance;
       }
    }
    

    Answer: the subtraction based comparator is an idiom that dates bake to Unix. However, it can fail since the difference between two 32-bit integers may not fit in a 32-bit int. Unless you know that the integers lie in a small range, it's safer to use

        public int compareTo(Customer b) {
            Customer a = this;
            if (a.balance < b.balance) return -1;
            if (a.balance > b.balance) return +1;
            return 0;
       }
    

    Given n time stamps when file is requested from web server, find largest interval of time at which no file arrives. Solution: sort by time stamp. Scan sorted list to identify maximum gap.

  5. Case insensitive order. Write a program to read in a sequence of strings and sort them in ascending order, ignoring case.
    String[] a = new String[N];
    for (int i = 0; i < N. i++) {
       a[i] = StdIn.readString();
    }
    Arrays.sort(a, String.CASE_INSENSITIVE_ORDER);
    

  6. Grades. Write a program Grade.java to represent a data type for grades (A, B+, etc.). It should implement the Comparable interface using the natural ordering on grades by gpa.
  7. Students. Write an ADT Student.java that represents a student in a college course. Each student should have a login (String), a section number (integer), and a grade (Grade). Use comparators to support sorting by any of the three fields as with the Transaction ADT.
  8. Sort by reverse domain. Write a program Domain.java to read in a list of domain names from standard input, and print the reverse domains in sorted order. For example, the reverse domain of cs.princeton.edu is edu.princeton.cs. This is useful for web log analysis. Hint: use s.split("\\.") to split the string s into tokens, delimited by dots. Then write an appropriate compareTo method.
  9. Software version numbers. Create an ADT Version that represents a software version number, such as 115.1.1, 115.10.1, 115.10.2. Implement the Comparable interface so that 115.1.1 is less than 115.10.1, and so forth.
  10. Case insensitive comparator. Implement your own version of the comparator String.CASE_INSENSITIVE_ORDER.
    public class CaseInsensitive implements Comparator<String> {
       public int compare(String a, String b) {
          return a.compareToIgnoreCase(b);
       }
    }
    
  11. Descending order order string comparator. Implement a comparator that sorts string in descending order instead of ascending order.
    public class Descending implements Comparator<String> {
       public int compare(String a, String b) {
          return -a.compareToIgnoreCase(b);
       }
    }
    
    Alternatively, you can use Collections.reverseOrder(). It returns a Comparator that imposes the reverse of the natural ordering of objects that implement the Comparable iterface.
  12. Spam campaign. To initiate an illegal spam campaign, you have a list of email addresses from various domains (the part of the email address that follows the @ symbol). To better forge the return addresses, you want to send the email from another user at the same domain. For example, you might want to forge an email from wayne@princeton.edu to rs@princeton.edu. How would you process the email list to make this an efficient task?
  13. California runoff election for governor. In order to thwart bias against candidates whose names appear toward the end of the alphabet, California sorted the candidates appearing on a ballot in the following order:
    R W Q O J M V A H B S G Z X N T C I E K U P D Y F L
    

    Write a program California.java that sorts strings according to this ordering. Try it on this list of candidates. Assume that each string is comprised solely of uppercase letters. Hint: Use a Comparator.

  14. Given an array of N integers, the 2-sum problem is to find a pair of integers whose sum is closest to zero. Describe an O(N log N) algorithm for the problem. Solution: sort by absolute value - the best pair is now adjacent.
  15. Given an array of N integers, the 3-sum problem is to determine find three integers whose sum is closest to zero. Design an algorithm that uses O(N2 log N) time and O(N) space. Hint: sort and binary search.
  16. Design an algorithm for the 3-sum problem that runs in O(N2) time and O(N) space. Hint: sort and clever search. Remark: quadratic lower bound in restricted decision tree model where each decision is based on the sign of an arbitrary linear combination of 3 or fewer inputs. (3-SUM hard.)
  17. For some fixed linear equation in 3 variables (say with integer coefficients), given N numbers, do any 3 of them satisfy the equation? Design a quadratic algorithm for the problem.
  18. Kendall's tau distance. Given two permutations, Kendall's tau distance is the number of pairs out of position. Write a program KendallTau.java that computes the the Kendall tau distance between two permutations of size N in O(N log N) time. Useful in top-k lists, social choice and voting theorey, comparing genes using expression profiles, and ranking search engine results.
  19. Antipodal points. Given N points on a circle, centered at the origin, design an algorithm that determines whether there are two points that are antipodal, i.e., the line connecting the two points goes through the origin. Your algorithm should run in time proportional to N log N.
  20. Antipodal points. Repeat the previous question, but assume the points are given in clockwise order. Your algorithm should run in time proportional to N.
  21. Idle time. Suppose that a parallel machine processes n jobs. Job j is processed from sj to tj. Given the list of start and finish times, find the largest interval where the machine is idle. Find the largest interval where the machine is non-idle.

    Simple polygon. Given N points in the plane, draw a simple polygon with the N points as vertices. Hint: find the point p with the smallest y coordinate, breaking ties with the smallest x coordinate. Connect the points in increasing order of the polar angle they make with p.

  22. Particle collision game. Implement the game Particles: you control a red ball with a mouse, trying to avoid colliding with blue balls that behave according to the laws of elastic collision. After a certain interval of time, introduce a new blue ball at random.
  23. Brownian motion. In 1827, the botanist Robert Brown observed the motion of wildflower pollen grains immersed in water using a microscope. He observed that the pollen grains were in a random motion, following what would become known as Brownian motion. This phenomenon was discussed, but no convincing explanation was provided until Einstein provided a mathematical one in 1905. Einstein's explanation: the motion of the pollen grain particles was caused by millions of tiny molecules colliding with the larger particles. He gave detailed formulas describing the behavior of tiny particles suspended in a liquid at a given temperature. Einstein's explanation was intended to help justify the existence of atoms and molecules and could be used to estimate the size of the molecules. Einstein's theory of Brownian motion enables engineers to compute the diffusion constants of alloys by observing the motion of individual particles. Here's a flash demo of Einstein's explanation of Brownian motion from here.
  24. Free path and free time. Free path = distance a particle travels between collisions. Plot histogram. Mean free path = average free path length over all particles. As temperature increases, mean free path increases (holding pressure constant). Compute free time length = time elapsed before a particle collides with another particle or wall.
  25. Collision frequency. Number of collisions per second.
  26. Root mean-square velocity. Root mean-square velocity / mean free path = collision frequency. Root mean square velocity = sqrt(3RT/M) where molar gas constant R = 8.314 J / mol K, T = temperature (e.g., 298.15 K), M = molecular mass (e.g., 0.0319998 kg for oxygen).
  27. Maxwell-Boltzmann distribution. Distribution of velocity of particles in hard sphere model obey Maxwell-Boltzmann distribution (assuming system has thermalized and particles are sufficiently heavy that we can discount quantum-mechanical effects). Distribution shape depends on temperature. Velocity of each component has distribution proportional to exp(-mv_x^2 / 2kT). Magnitude of velocity in d dimensions has distribution proportional to v^(d-1) exp(-mv^2 / 2kT). Used in statistical mechanics because it is unwieldy to simulate on the order of 10^23 particles. Reason: velocity in x, y, and z directions are normal (if all particles have same mass and radius). In 2d, Rayleigh instead of Maxwell-Boltzmann.
  28. Pressure. Main thermodynamic property of interest = mean pressure. Pressure = force per unit area exerted against container by colliding molecules. In 2d, pressure = average force per unit length on the wall of the container.
  29. Temperature. Plot temperature over time (should be constant) = 1/N sum(mv^2) / (d k), where d = dimension = 2, k = Boltzmann's constant.
  30. Diffusion. Molecules travel very quickly (faster than a speeding jet) but diffuse slowly because they collide with other molecules, thereby changing their direction. Two vessels connected by a pipe containing two different types of particles. Measure fraction of particles of each type in each vessel as a function of time.
  31. Time reversibility. Change all velocities and run system backwards. Neglecting roundoff error, the system will return to its original state!
  32. Maxwell's demon. Maxwell's demon is a thought experiment conjured up by James Clerk Maxwell in 1871 to contradict the second law of thermodynamics. Vertical wall in middle with molecule size trap door, N particles on left half and N on right half, particle can only go through trap door if demon lets it through. Demon lets through faster than average particles from left to right, and slower than average particles from right to left. Can use redistribution of energy to run a heat engine by allowing heat to flow from left to right. (Doesn't violate law because demon must interact with the physical world to observe the molecules. Demon must store information about which side of the trap door the molecule is on. Demon eventually runs out of storage space and must begin erasing previous accumulated information to make room for new information. This erasing of information increases the entropy, requiring kT ln 2 units of work.)
  33. Metric space. Extend to support spheres and hyperplanes in an arbitray metric space.
  34. Cell method. Useful optimization: divide region into rectangular cells. Ensure that particles can only collide with particles in one of 9 adjacent cells in any time quantum. Reduces number of binary collision events that must be calculated. Downside: must monitor particles as they move from cell to cell.
  35. Multi-particle collisions. Handle multi-particle collisions. Such collisions are important when simulating the break in a game of billiards.
  36. Sorting parallel arrays. When sorting parallel arrays, it is useful to have a version of sorting routine that returns a permutation, say index[], of the indices in sorted order. Add a method indirectSort to Insertion.java that takes an array of Comparable objects a[] as input, but instead of rearranging the elements of a[], returns an integer array index[] so that a[index[0]] through a[index[N-1]] are the elements in ascending order.
  37. Inplace permutation. Write a program Permutation.java that includes functions that take an array and a permutation (or inverse permutation) and rearranges the elements in the array according to the permutation (or inverse permutation). Do it in-place: use only a constant amount of extra memory.
  38. List files. Write a program Files.java that takes the name of a directory as a command line input and prints out all of the files in the current directory, sorted by filename. Hint: use the File ADT.
  39. List files. Write comparators for the type File to order by increasing/decreasing order of file size, ascending/descending order of filename, and ascending/descending order of last modification date. Use these comparators in a program LS that takes a command line argument and list the files in the current directory According to that order, e.g., "-t" to sort by timestamp. Support multiple flags to break ties. Be sure to use a stable sort.
  40. Ticket ranges. Given a list of ticket seats of the form A1, A2, A11, A10, B7, B9, B8, B3, find the largest non-empty block of adjacent seats, e.g., A3-A9.
  41. Smith's rule. The following problem arises in supply chain management. You have a bunch of jobs to schedule on a single machine. (Give example.) Job j requires p[j] units of processing time. Job j has a positive weight w[j] which represents its relative importance - think of it as the inventory cost of storing the raw materials for job j for 1 unit of time. If job j finishes being processed at time t, then it costs t * w[j] dollars. The goal is to sequence the jobs so as to minimize the sum of the weighted completion times of each job. Write a program SmithsRule.java that reads in a command line parameter N and a list of N jobs specified by their processing time p[j] and their weight w[j], and output an optimal sequence in which to process their jobs. Hint: Use Smith's rule: schedule the jobs in order of their ratio of processing time to weight. This greedy rule turns out to be optimal.
  42. Contiguous sum. Given a list of real numbers and a target value V, find a contiguous block (of any length) whose sum is as close to K as possible.

    Brute force: compute the sum of each contiguous block by brute force. This takes O(N^3) time.

    Partial sums: compute all partial sums s[i] = a[0] + a[1] + ... + a[i] so that contiguous blocks have a sum of the form s[j] - s[i]. This takes O(N^2) time.

    Sort and binary search: form the partial sums as above and then sort them in ascending order. For each i, binary search for the s[j] that is as close to s[i] as possible. This takes O(N log N) time.

  43. Rhyming words. For your poetry class, you would like to tabulate a list of rhyming words. A crude way to accomplish this task is as follows:
    • Read in a dictionary of words into an array of strings.
    • Reverse the letters in each word, e.g., confound becomes dnuofnoc.
    • Sort the resulting array of words.
    • Reverse the letters in each word back to their original state.

    Now the word confound will be next to words like astound and compound. Write a program Rhymer.java that reads in a sequence of words from standard input and prints them out in the order specified above.

    Now repeat, but use a customized Comparator that sorts lexicographically from right-to-left.

Exercises

  1. Give an O(N log N) algorithm for computing the median of a sequence of N integers. Hint: sort first.
  2. Give an O(N log N) algorithm for computing the mode (value that occurs most frequently) of a sequence of N integers.
  3. Given a sequence of N real numbers, find the pair of integers that are closest in value. Give a O(N log N) algorithm.
  4. Given a sequence of N real numbers, find the pair of integers that are farthest apart in value. Give a O(N) algorithm.
  5. Given a sorted list of N integers and a target integer x, determine in O(N) time whether there are any two that sum to exactly x.
  6. Given two sorted lists of size N1 and N2, find the median of all elements in O(log N) time where N = N1 + N2.
  7. Give a O(N) algorithm for sorting a sequence of N integers that are between 0 and 99.
  8. Give a O(N) algorithm for computing the median of a sequence of N integers that are between 0 to 99.
  9. Give N points in the plane with integer coordinates, describe how to enumerate all points that are included two or more times in O(N log N) time. Solution: Sort using x-coordinate as primary key and y-coordinate as secondary key. Now, all duplicates are consecutive.
  10. A sequence of N consecutive insert into heap operations takes O(N) comparisons. Explain why its not the case that a sequence of N consecutive delete-max operations takes O(N) comparisons. Solution: would violate Theta(N log N) sorting lower bound.
  11. An article on Slashdot announces a new comparison-based priority queue that performs insert and delete-max in O(sqrt(log n)). Explain why this is impossible. Solution: would imply a O(n sqrt(log n)) comparison-based sorting algorithm.
  12. Suppose you have a sequence of N elements, with at most log N distinct ones. Describe how to sort them in O(N log log N) time.
  13. Element distinctness. Given N elements x1, ..., xn from a totally ordered universe, is there a pair i ≠ j such that xi = xj? Theta(N log N) lower bound in algebraic decision tree model. O(N log N) easy by sorting. Solution: Given N distinct values. Let ai be ith smallest element. In any comparison based algorithm for the problem, we must compare ai against ai-1; otherwise the algorithm could not distinguish between the case ai-1 < ai and ai-1 = ai. Consider two different permutations of the N elements. There is an element ai where ai-1 precedes ai in the first permutation but ai-1 precedes ai in the second. Since every comparison based algorithm must compare ai and ai-1, the algorithm runs differently on the two permutations. Thus, there are at leats N! leaves. reference
  14. Element distinctness. Given N elements, prove that determining whether any two repeated takes at least N log N comparisons. Assume that you can only access the elemnts throught the less() function. Solution: suppose the elements are all distinct and a_1 < a_2 < ... < a_N. Any correct algorithm for element distinctness must compare a_i with a_i+1 for each i < N. If not, then the algorithm would produce the same output if we changed a_i+1 to a_i (but this would change the answer from no duplicates to has duplicates). The set of comparisons the algorithm uses forms a DAG. Find the total order (in linear time) and this gives the sorted order. Thus, the algorithm can be used for sorting distinct elements and the sorting lower bound applies.
  15. Coin distinctness. Given N coins and a balance scale, determine if the coins all weigh different amounts. Prove that any algorithm that solves the problem has complexity at least N log N. Hint: you may use the fact that given N real numbers, element distinctness requires at least N log N comparisons in the linear decision tree model (where you can make comparisons between arbitrary linear combinations of the N elements, e.g, x1 + 2x2 < x3).
  16. Existence symbol table. Suppose you had a comparison-based algorithm that supports insert and exists in O(1) comparisons per operation. Explain why this is impossible in this model of computation. Solution: to solve element distinctness in O(N) time, for each of the N elments, check whether it's already in the data structure (if so, then this is a duplicate); if not, insert it.
  17. Lower bound for closest pair. Theta(N log N) bound for closest pair. Solution: could solve element distinctness problem in sub-lineararithmic time if had a sub-lineararithmic algorithm by finding closest pair and checking if they are equal.
  18. Lower bound for mode. Theta(N log N) bound for mode. Solution: could solve element distinctness problem by finding mode and checking whether it is more than one.
  19. Majority. Give a sequence of N elements, design an O(N) algorithm to find the majority (element that appears more than N/2 times) or determine that none exists. Solution: if a and b are two elements and a != b, then remove both of them; majority still remains. Use N-1 comparisons to find candidate for majority; use N-1 comparisons to check if candidate really is a majority. [Could solve with linear time median finding, but this is much simpler.]
  20. Set equality or set inclusion. Given two sets of elements S and T, does S = T or is S a subset of T?
  21. Set disjointness. Given two sets S and T, does S intersect T equal the empty set? Give an O(N log N) algorithm and a matching lower bound.
  22. Set disjointness. Given two sets of integers S and T, each in ascending order, does S intersect T equal the empty set? Prove an Omega(N log N) lower bound or given an O(N) algorithm.
  23. Lower bound for merging two lists. Show any comparison-based algorithm for merging two lists of size N requires at least 2N-1 comparisons in the worst case. Solution: lower bound holds even if all elements are distinct. If the ith and (i+1)st smallest elements are in different lists, then they must be compared.
  24. Lower bound for binary search. log(N+1) comparisons required. Solution: lower bound holds even if all elements are distinct....
  25. Euclidean TSP and MST lower bounds. Given an Omega(N log N) for the Euclidean TSP or Euclidean MST (requires algebraic decision trees to make sense since you want to allow a reasonable algorithm to compute distances). Solution: choose N points on a line. Optimal tour must visit the points in ascending order.
  26. Nearly sorted. Given an array of N elements, each which is at most k positions from its target position, devise an algorithm that sorts in O(N log k) time.

    Solution 1: divide the file into N/k pieces of size k, and sort each piece in O(k log k) time, say using mergesort. Note that this preserves the property that no element is more than k elements out of position. Now, merge each block of k elements with the block to its left.

    Solution 2: insert the first k elements into a binary heap. Insert the next element from the array into the heap, and delete the minimum element from the heap. Repeat.

  27. Sorting with duplicates. There are at least N! / (n1!) (n2!) ... (nk!) leaves; thus lower bound is log(N! / (n1!) (n2!) ... (nk!)). Give matching upper bound??? partial reference. Mergesort where you remove duplicates as soon as they are discovered. How many of the ni copies of element i remain after j merge passes?
  28. Smallest and second smallest. Describe how to find both the smallest and second smallest element using at most N + log N comparisons. Solution: divide the elements into pairs and compare two elements in each pair. Recur with the N/2 winners from each pair. After N-1 comparisons, we have the minimum element. Note that each element is compared with at most log N other elements. In particular, the smallest element is compared with at most log N other elements, and one of these must be the second smallest. If you keep track of all the comparisons, you can remember the log N involving the smallest element and compare these.
  29. Smallest and largest. Describe how to find both the smallest and largest elements using at most ceil(3n/2) - 2 comparisons. Answer: divide the elements into pairs and compare two elements in each pair. Put the smallest elements in A and the largest in B. If n is odd, put element n in both A and B. This requires floor(n/2) comparisons. Now directly compute the minimum in A (ceil(n/2) - 1 comparisons) and the maximum in B (ceil(N/2) - 1) comparisons. [In fact, this is best possible.]
  30. Counting inversions. Prove a Theta(N log N) lower bound. This is an open research question.
  31. Nuts and bolts. Prove a Theta(N log N) lower bound for the nuts and bolts problem in the quicksort section. [A matching worst case upper bound of O(N log N) is known, but it is remarkably complicated.]
  32. Sorting a linked list. Given a singly linked list of N elements, how could you sort it in guaranteed O(N log N) time, stably, and with O(1) extra space?