COS 597A, Fall 2011 - Problem Set 5
Due at 3:00pm, Wednesday November 30, 2011
Collaboration Policy
You may discuss problems with other students in the class. However,
each student must write up his or her own solution to each problem
independently. That is, while you may formulate the solutions to
problems in collaboration with classmates, you must be able to
articulate the solutions on your own.
Late Penalties
- 5% of the earned score if submitted after class but by 5:45pm
the day due.
- 20% of the earned score if submitted by 5pm on Friday 12/2/11.
- 40% of the earned score if submitted by 5pm on Monday 12/5/11.
- No credit if submitted
later than the 40% penalty deadline.
Problem 1:
Analyze the number of page reads and page writes required to
delete a value from a B+ tree leaf in the case that the deletion
requires merges up the tree until the root is eliminated and the B+
tree height is reduced by 1. Include the page reads necessary to
find the leaf; assume the height of the B+ tree is 3. Your
analysis should give a precise number
of page reads and page writes, e.g "5". Don't forget that
in a B+ tree, a linked list of the leaves is maintained. Use the pseudo-code in
Figure 10.15 of Ramakrishnan &
Gehrke as the
specification of the algorithm. Wherever the algorithm is
too vague and the details will affect your answer, make an
assumption and state it. Make sure the source of each read or write
in your analysis is clear.
Problem 2:
The following is
a common algorithm for implementing R-S for relations R and S, which
are union-compatible:
1. Sort R using the combination of all fields;
2. Sort S using the combination of all fields;
3. Scan the sorted R and S in parallel; write to the
result those tuples of R that do not appear in S.
Let M be the number of pages for R
and N the
number of pages for S. If the number of buffer pages
available to the algorithm is at least max(sqrt(M), sqrt(N))+1, this
algorithm costs 5(M+N) disk page reads/writes.
Part a. Suppose that
S has a secondary (unclustered) hash index on attribute A, which is a
candidate key for S. Propose an algorithm for R-S that uses the hash
index and analyze its disk I/O cost and buffer use. Express the worst-case cost as
a function of some or all of the parameters M, N, nR, nS, and F, where M is the number of pages for
R, N is the
number of pages for S, nR is the number of
tuples in R, nS is the number of tuples in S,
and F is the number of pages in the buffer. You may add additional
parameters if you need
them. Does it matter
whether attribute A is a candidate key for R?
Part b. Suppose
M=1000, N=20,000, nR=10,000, nS=200,000
and F=150. Which algorithm is more
efficient, the algorithm that sorts R and S and scans in parallel or
your algorithm of Part a? Note that ceiling(sqrt(20,000)+1)
= 143 < F. What change in parameter values would change which
algorithm
wins?
Problem
3:
Consider the evaluation of query (R ◊◊ S ◊◊ T ) for
relations R, S, and T, where ◊◊ denotes natural join. Relations R, S
and T share a single attribute, A, and no pair of relations shares any
other attributes. R contains 50,000 tuples and has 5 tuples per
page; S contains 100,000 tuples and has 10
tuples per page; T contains 1000 tuples and has 25 tuples per page. R has a primary (clustered) hash index on R.A. S has a primary (clustered) B+ tree index on
S.A, which is the primary key for S. T
is
stored
as a heap and has a secondary (unclustered) hash index on T.A. Forty-five (45) buffer pages are available in
main memory for the evaluation of the query.
Part a:
Note that T can fit in the main memory buffer with 5 pages to
spare. Given this, the block nested loop algorithm becomes
a more desirable algorithm. If block
nested
loop is used for both joins, what is the best order to do
the joins in terms of minimizing disk I/O?
Why? For each join calculation, which is the inner and
which the outer relation? Calculate the disk I/O cost of the two block nested loop joins for the
order you have chosen.
Part
b:
Now consider evaluating the query by using hash join to calculate R ◊◊
S first.
- Are either of the indexes on R and S useful? Explain.
- After using the hash join, what join algorithm would you use for
the second join? Explain.
- Calculate the disk I/O cost for the method of evaluating (R ◊◊
S ◊◊ T ) that results from your answers to i and ii.
Part
c:
Have you used pipelining in Parts a and
b? If not, could you? Explain.