COS 597A, Fall 2008 - Problem Set 6

Due at 3:00pm, Wednesday November 26, 2008.


Collaboration Policy

You may discuss problems with other students in the class. However, each student must write up his or her own solution to each problem independently. That is, while you may formulate the solutions to problems in collaboration with classmates, you must be able to articulate the solutions on your own.

Late Penalties


The two problems below ask you to analyze the disk I/O cost, as the number of blocks of disk read or written, used by various algorithms for query evaluations.  As we did in class, your analysis will be in terms of:

|R| the number of disk blocks used to store R,
|S| the number of disk blocks used to store S,
F the number of blocks of main memory buffer,
nR the number of records in R,
nS the number of records in S. 

Do not simply substitute values for the parameters in the equations given in class or in the text.  These equations are for the average or typical case.   In several parts of these problems you have information on sizes and some primary keys.  Make your analysis as precise as possible.  

Note that we speak of relations and tuples when presenting the queries - the relational database level of abstraction.  These are the same as files and records, respectively, viewed at the file level of abstraction.


Problem 1:
Part a.   The following is a common algorithm for implementing R-S for relations R and S, which are union-compatible:

   1. Sort R using the combination of all fields;
   2. Sort S using the combination of all fields;
   3. Scan the sorted R and S in parallel; write to the result those tuples of R that do not appear in S.

For each stage of the algorithm write its worst-case disk I/O cost and the minimum number of blocks of buffer required for arbitrary R and S.  The worst-case cost may be expressed as a function of the number of buffer blocks F.

Part b.  Now suppose that S has a secondary (unclustered) hash index on attribute A, which is a candidate key for S. Propose an algorithm for R-S that uses the hash index and analyze its disk I/O cost and buffer use. Does it matter whether attribute A is a candidate key for R?

Part c.  Suppose |R|=1000,  |S|=20,000, nR=10,000, nS=200,000 and F=150. Which algorithm is more efficient? What change in parameter values would change which algorithm wins?


 Problem 2:
Consider the evaluation of query (R ◊◊ S ◊◊ T ) for relations R, S, and T, where denotes natural join.     Relations R, S and T share a single attribute, A, and no pair of relations shares any other attributes.  R contains 50,000 tuples and has 5 tuples per block;  S contains 100,000 tuples and has 10 tuples per block; T contains 1000 tuples and has 25 tuples per block.  R has a primary (clustered) hash index on R.A.  S has a primary (clustered) B+ tree index on S.A, which is the primary key for S.   T is stored as a heap and has a secondary (unclustered) hash index on T.A.  Forty-five (45) buffer blocks are available in main memory for the evaluation of the query.

Part a:  Note that T can fit in the main memory buffer with 5 blocks to spare.    Given this, the block nested loop algorithm becomes a more desirable algorithm.  If block nested loop is used for both joins, what is the best order to do the joins in terms of minimizing disk I/O?  Why?  For each join calculation, which is the inner and which the outer relation?  Calculate the disk I/O cost of the two block nested loop joins for the order you have chosen. 

Part b:  Now consider evaluating the query by using hash join to calculate R ◊◊ S first.  

  1. Are either of the indexes on R and S useful?  Explain.
  2. After using the hash join, what join algorithm would you use for the second join?  Explain.
  3. Calculate the disk I/O cost for the method of evaluating (R ◊◊ S ◊◊ T ) that results from your answers to i and ii.

Part c:  Have you used pipelining in Parts a and b?  If not, could you? Explain.