COS 425, Fall 2006 - Problem Set 5

Due at 1:30pm, Monday Nov. 20, 2006.


Collaboration Policy

You may discuss problems with other students in the class. However, each student must write up his or her own solution to each problem independently. That is, while you may formulate the solutions to problems in collaboration with classmates, you must be able to articulate the solutions on your own.

Late Penalties


Problem 1  When discussing I/O costs in class, we used parameters D, the average time to read or write a page from disk, and B, the number of data pages in a file. Parameter D reflects seek time, rotational latency and transfer time. Suppose we introduce additional parameters T, which is only the transfer time of a data page, and P, the number of data pages per cylinder of the disk.

Part a: Using parameters D, B, T, and P, give an equation for the I/O cost of a range search on a sorted file stored contiguously on disk. (I have used the term "sequentially" in class.  "Sequentially" is somewhat ambiguous - does it refer to search key order or disk sector order.  By "contiguously" I mean the file pages are stored in order on the disk so that arm movement is minimized:  consecutively on the same  track, then on the same cylinder, then on adjacent cylinders. There is no rotational latency in reading from one track to the next within a cylinder. )

Part b: Now consider the index structure shown in Figure 10.2 on page 340 of Database Management Systems by Ramakrishnan and Gehrke. Suppose that the index file is sorted by key values (Ki's) and stored contiguously on disk. Let I be the number of data pages in the index file. Assume that the data file has B pages and that these are not stored contiguously on disk. What is the equation for the I/O cost of range search for this index file and data file configuration? When is this a savings over your answer of part a?

Problem 2   In class we executed inserts and delete on an order one B+ tree, finishing with tree shown in this pdf file. 

Part a:  Delete the entry with B+ tree key value 29 from the B+ tree.  Describe each step of the deletion.
Part b:  Delete the entry with B+ tree key value 27 from the B+ tree resulting from Part a.  Describe each step of the deletion.

Problem 3   In this problem we consider B+ trees on a file that has large records: one record requires a page to store. In contrast, the order of the B+ tree is d=16 and each interior node of the B+ tree is stored in one page; therefore up to 32 B+ tree keys and 33 pointers fit in one page.  You may assume the B+ tree  key is a candidate key.

Part a: First consider a B+ tree with data entries that are the actual data records, i.e. each leaf of the B+ tree holds exactly one data record (Alternative 1 in the terminology of Database Management Systems by Ramakrishnan and Gehrke). Note that because each data record is so large, leaves can only be full, not between half full and full.  In all other ways, this B+ tree follows all the B+ tree rules. Document in detail the precise  number of disk page reads and writes necessary for the insertion of a record in the best case. Assume the height of the B+ tree is h1.

Part b: Now consider a B+ tree with data entries that are pointers to the actual records, i.e. each leaf of the B+ tree holds a set of (B+ tree key value, record id) pairs (Alternative 2 in the terminology of Database Management Systems by Ramakrishnan and Gehrke). Assume 32 such pairs can fit, and leaves are kept at least half full (as we are accustomed to).

i. What can you say about the height, h2 of this B+ tree in comparison to h1?  Justify any conclusions you make.
ii. What is the best-case cost of an insert with this B+ tree? As in Part a, document in detail the precise number of disk page reads and writes necessary for the insertion of a record in the best case.

Part c: Discuss the pros and cons of each of the alternatives (represented by parts A and B) for building a B+ tree for this file.

Problem 4   As discussed in class and in Section 10.7 of  Database Management Systems by Ramakrishnan and Gehrke, for B+ trees, there are several ways to deal with records that have the same B+ tree  key value (duplicate keys).   The figure below shows an B+ tree of order 2  for the instance in Figure 10.30 on page 368 of  Database Management Systems by Ramakrishnan and Gehrke.  This B+ tree uses duplicate keys in the interior nodes and has leaves containing pointers of the form <page-id,slot> to the actual data pages.  (See Exercise 10.10, part 2 for an explanation of the <page-id, slot> notation. )

Consider the following different scheme for dealing with duplicate key values:  Interior nodes satisfy B+ tree properties, and there are no duplicate keys in interior nodes.  Each data entry  in a leaf is of the form (B+ tree key value, list of pointers to all records with that key value indexed by the tree).  This means the entries for key values are of variable length in the leaves.  More than one key value can reside in a leaf, but the entire list of pointers for a key value must be in the same leaf.  The only use of overflow pages is if the list of pointers for a key value cannot fit in one leaf;  then the leaf consists of the key value and as much of the list as will fit and the rest of the list is in one or more overflow pages.

Part a.   If a leaf can fit 2d pairs of the form (key value, record pointer), what is a reasonable assumption about the size of the largest list of pointers that can fit in the leaf using the form (key value, list of pointers to all records with that key value indexed by the tree)?
Part b.  The scheme outlined above cannot guarantee that each leaf is at least half full.  What is the problem?
Part c.  Give a tree for the instance in Figure 10.30 that follows this new scheme.
Part d.  How might you fix the problem of Part b without relegating all duplicates to overflow pages?






Problem 5  Let 0, 1, 2, 3,  64, 128, 192,  4,  8,  16,  32 be a sequence of hash key values to be inserted in a dynamic hashing index.  You may view each value in the sequence to be the result of applying some base hash function h to the actual hash field value of a record.

Part a Use extendible hashing starting with 4 empty buckets and h0(x)= h(x) mod 4.   Assume each bucket can hold four data entries.  Show the initial (empty) configuration of the directory and buckets. Insert each value of the sequence in order.  Show the configuration each time the directory changes, including pointer changes.  Show the final configuration.  How many pages are used?  Use the diagramming style of Ramakrishnan and Gehrke, e.g. Figures 11.2 - 11.6 (which is identical to the diagramming style in the class slides).

Part b Use linear hashing starting at level=0 with 4 empty buckets and h0(x)= h(x) mod 4.   Again assume each bucket can hold four data entries.  Show the initial (empty) configuration of the buckets and the initial value of  next. Insert each value of the sequence in order.  Show the configuration each time next changes, including the values of next and level.  Show the final configuration.  How many pages are used?  The diagramming style in the class slides and the diagramming style of Ramakrishnan and Gehrke ( Figures 11.8-11.13) differ slightly.  Use whichever you prefer.