COS 597A, Fall 2011: Problem Set 5

COS 597A, Fall 2011 - Problem Set 5

Due at 3:00pm, Wednesday November 30, 2011

Collaboration Policy

You may discuss problems with other students in the class. However, each student must write up his or her own solution to each problem independently. That is, while you may formulate the solutions to problems in collaboration with classmates, you must be able to articulate the solutions on your own.

Late Penalties

5% of the earned score if submitted after class but by 5:45pm the day due.
20% of the earned score if submitted by 5pm on Friday 12/2/11.
40% of the earned score if submitted by 5pm on Monday 12/5/11.
No credit if submitted later than the 40% penalty deadline.

Problem 1:
Analyze the number of page reads and page writes required to delete a value from a B+ tree leaf in the case that the deletion requires merges up the tree until the root is eliminated and the B+ tree height is reduced by 1. Include the page reads necessary to find the leaf; assume the height of the B+ tree is 3. Your analysis should give a precise number of page reads and page writes, e.g "5". Don't forget that in a B+ tree, a linked list of the leaves is maintained. Use the pseudo-code in Figure 10.15 of Ramakrishnan & Gehrke as the specification of the algorithm. Wherever the algorithm is too vague and the details will affect your answer, make an assumption and state it. Make sure the source of each read or write in your analysis is clear.

Problem 2:
The following is a common algorithm for implementing R-S for relations R and S, which are union-compatible:

   1. Sort R using the combination of all fields;
   2. Sort S using the combination of all fields;
   3. Scan the sorted R and S in parallel; write to the result those tuples of R that do not appear in S.

Let M be the number of pages for R and N the number of pages for S.   If the number of buffer pages available to the algorithm is at least max(sqrt(M), sqrt(N))+1, this algorithm costs 5(M+N) disk page reads/writes.

Part a. Suppose that S has a secondary (unclustered) hash index on attribute A, which is a candidate key for S. Propose an algorithm for R-S that uses the hash index and analyze its disk I/O cost and buffer use. Express the worst-case cost as a function of some or all of the parameters M, N, n_R, n_S, and F, where M is the number of pages for R, N is the number of pages for S, n_R is the number of tuples in R,n_S is the number of tuples in S, and F is the number of pages in the buffer. You may add additional parameters if you need them. Does it matter whether attribute A is a candidate key for R?

Part b. Suppose M=1000, N=20,000, n_R=10,000, n_S=200,000 and F=150. Which algorithm is more efficient, the algorithm that sorts R and S and scans in parallel or your algorithm of Part a?   Note that ceiling(sqrt(20,000)+1) = 143 < F. What change in parameter values would change which algorithm wins?

Problem 3:
Consider the evaluation of query (R ◊◊ S ◊◊ T ) for relations R, S, and T, where ◊◊ denotes natural join. Relations R, S and T share a single attribute, A, and no pair of relations shares any other attributes. R contains 50,000 tuples and has 5 tuples per page; S contains 100,000 tuples and has 10 tuples per page; T contains 1000 tuples and has 25 tuples per page. R has a primary (clustered) hash index on R.A. S has a primary (clustered) B+ tree index on S.A, which is the primary key for S. T is stored as a heap and has a secondary (unclustered) hash index on T.A. Forty-five (45) buffer pages are available in main memory for the evaluation of the query.

Part a: Note that T can fit in the main memory buffer with 5 pages to spare. Given this, the block nested loop algorithm becomes a more desirable algorithm. If block nested loop is used for both joins, what is the best order to do the joins in terms of minimizing disk I/O? Why? For each join calculation, which is the inner and which the outer relation? Calculate the disk I/O cost of the two block nested loop joins for the order you have chosen.

Part b: Now consider evaluating the query by using hash join to calculate R ◊◊ S first.

Are either of the indexes on R and S useful? Explain.
After using the hash join, what join algorithm would you use for the second join? Explain.
Calculate the disk I/O cost for the method of evaluating (R ◊◊ S ◊◊ T ) that results from your answers to i and ii.

Part c: Have you used pipelining in Parts a and b? If not, could you? Explain.