COS 425, Fall 2006: Problem Set 6

COS 425, Fall 2006 - Problem Set 6

Due at 1:30pm, Wednesday Nov. 29, 2006.

Reminders:

During the week of November 27: see Professor LaPaugh for a verbal progress report on your project. See the Project Page .
Wednesday December 6 through Friday December 8: Take-home Exam 2. (Open-book with the same rules as for Exam 1.)

Collaboration Policy

You may discuss problems with other students in the class. However, each student must write up his or her own solution to each problem independently. That is, while you may formulate the solutions to problems in collaboration with classmates, you must be able to articulate the solutions on your own.

Late Penalties

*Holiday special* No penalty if submitted by 5pm on Friday 12/1/06.
*But note* No credit if submitted later than the 5pm on Friday 12/1/06.

Problem 1. In Database Management Systems by Ramakrishnan and Gehrke, Chapter 14:
Part a: Exercise 14.4, pg. 476, Part 4.
Part b: Exercise 14.4, pg. 476, but modify the storage of R so that it supports an Alternative 1, clustered B+ tree index of order 50 on attribute a of R. Also assume that the B+ tree contains 1000 distinct key values. What is the cost of joining R and S using an index nested loop join?

Do not simply substitute values for the parameters (M, N, etc. ) in the equations given in class or in the text. These equations are for the average or typical case. You have information on sizes and the primary key of S. Make your analysis as precise as possible. Note: in class we have used the term "data record" or "record" to refer to a "tuple" as used by Ramakrishnan and Gehrke.

Problem 2.
Part a. In Database Management Systems by Ramakrishnan and Gehrke, the following algorithm is given for implementing R-S for relations R and S, which are union-compatible:

Sort R using the combination of all fields;
Sort S using the combination of all fields;
Scan the sorted R and S in parallel; write to the result those tuples of R that do not appear in S.

For each stage of the algorithm write its disk I/O cost and the number of buffers required. Your costs should be in terms of parameters:

M, the number of disk pages of R
N, the number of disk pages of S
p, the number of records per page of R and S
F, the number of buffer pages available in memory

Part b. Now suppose that S has an Alternative 2 hash index on attribute a, which is a candidate key for S. Propose an algorithm for R-S that uses the hash index and analyze its disk I/O cost and buffer use. Use the parameters given in Part a. Does it matter whether attribute a is a candidate key for R?

Part c. Suppose p=10, M=1000 N=20,000 and F=150. Which algorithm is more efficient? What change in parameter values would change which algorithm wins?

Problem 3. Assume that you have relation R with 500,000 tuples and 10 tuples per page. The following indexes exist:

an Alternative 1, clustered B+ tree index of order 50 and height 3 on attribute x of R containing 10,000 distinct key values;
an Alternative 2 (unclustered) B+ tree index of order 50 and height 2 on attribute y of R, where attribute y is a candidate key of R ranging uniformly between 0 and 999,999;
an Alternative 2 (unclustered) hash index on attribute z of R containing 20,000 distinct key values.

We wish to select all tuples of R with (R.x = a) AND (500 < R.y < 600) AND (R.z = b), where a and b are constants.

Part a. List the different algorithms that make use of one or more of the indexes for evaluating the selection query. Describe briefly how each index is used in each algorithm.
Part b. Without doing a detailed analysis, choose the algorithm from your answer to Part a that you believe is most efficient in terms of disk I/O cost. Justify your answer.
Part c. Do a detailed disk I/O cost analysis of the algorithm you have chosen in Part b.

Problem 4. Assume that you have relations R and S of Problem 1 above, with the same characteristics and the index of Part b. Assume that you also have a relation T with the same properties as relation R, but with no indexes built for it and that attribute c of T ranges over 3000 values. Assume that you have 52 buffers as in Problem 1. Consider the evaluation of the query

((R ◊◊_R.a=S.bS ) ◊◊_S.b=T.c T )

using the left-deep evaluation tree:

Part a. Calculate the disk I/O cost of evaluating the query using the sort-merge join algorithm for each of the joins.

Part b. Can pipelining be used to avoid writing to disk the intermediate result (R ◊◊_R.a=T.cT )? If so, describe how to manage the buffers during pipelining.
If not, why not?