STRING SORTS STUDY GUIDE
Key indexed counting.
Allows you to sort N keys that are integers between 0 and R-1 in time proportional to N + R.
Beats linearithmic lower bound by avoiding any binary compares.
This is a completely different philosophy for how things should be sorted. This is the most important concept for this lecture.
- String - sequence of characters from a finite ordered alphabet.
In Java, our alphabet is the set of all 16-bit integers (representing Unicode characters).
- Radix - just another word for 'base' as in the base of a number system.
For example, the radix for words written in lowercase English letters is 26. For number written in Arabic numerals it is 10.
- Radix sort - a sort that works with one character at a time (by grouping objects that have the same digit in the same position).
- Note: I will use 'character' and 'digit' interchangably in this study guide.
Manually performing LSD and MSD. Should be doable in your sleep.
- Requires fixed length strings (can pad end of strings with 'smaller-than-everything' character).
- Requires proportional to W N calls to charAt(). Why?
- Requires time proportional to W(N + R). Why?
- Why do we consider these run times to be linear, despite the fact that they involve products WN and WR?
- Requires N + R space. Why?
- What sorting technique is used as a subroutine in LSD? Would a standard technique (e.g. quicksort) work? Does the sort need to be stable?
- For a fixed alphabet and key size, what are the best case and worst case inputs for LSD?
- What sorting technique is used as a subroutine in MSD? Would a standard technique (e.g. quicksort) work? Does the sort need to be stable?
- How much memory does MSD use? Why is MSD so memory hungry? What sort of inputs result in the greatest memory usage?
- Why is it so important to switch to insertion sort (or another sort) for small subarrays? Why did we not have to do this in LSD?
- For a fixed alphabet and key size, what are the best and worst case inputs for MSD?
- What is the role of our special charAt(char, int) method?
3-way String Quicksort.
- Spring 2012, #6
- Fall 2012, #7
- Textbook 5.1.8, 5.1.10
- Fall 2012, #14
- How could we avoid the performance hit from our special charAt() function?
- What makes MSD cache unfriendly?
The addBlock() operation is used to add M Strings to an existing sorted data set of N Strings, where M << N. A
data set of size N is considered sorted if it can be iterated through in sorted order in N time.
COS226 student Frankie Halfbean makes two choices. First, he selects a sorted array as the data structure. Secondly, he selects insertion sort as the core algorithm, explaining that insertion sort is very fast for almost sorted arrays. To add a new block of M Strings, the algorithm simply creates an array of length N+M, copies over the old N values into the new array, copies over the new M values to the end of the array, and finally insertion sort is used to bring everything into order. The old array is left available for garbage collection.
(a) What is the worst case order of growth of the run time as a function of N and M?
(b) Design a scheme that has a better order of growth for the run time in the worst case.
(a) Since M is much less than N,the absolute worstcase is that each of the
M items is moved all the way to the front of the original array. In this
case, the runtime is MN.
(b) Use a string sort algorithm such as LSD or MSD or String Quicksort. The runtime for LSD and MSD sort is 2 W(M + N + R) where W is the size of the Strings and R is the size of the alphabet.