Our objective is to produce color images from the digitized Prokudin-Gorskii glass plate images. We want to stack three color channel images (blue, green, and red) and align them to form a single RGB color image.
Given a digitized glass plate image, we split it into equal thirds. The first section is the blue channel image (B), the second is the green channel image (G), and the third is the red channel image (R). To align G to B, we exhaustively search over a window of possible displacements, and save the displacement vector that yields the best score, computed from some image matching metric to compare B and the shifted (via the displacement vector) G. We then shift G by the best displacement vector, and stack it on B. We perform the same process again to align R to B.
To handle alignment for larger images, where an exhaustive search of displacements can be pretty slow, we implement a faster search procedure - image pyramids. Given two images, the alignment process recursively resizes the images at multiple scales (by factors of 2) and respectively aligns them from the coarsest scale (smallest image size) to the finest scale. This allows a gradual estimation of the best displacement with a much smaller search window, effectively reducing run time.
Prior to alignment, 10% of the image width/height was cropped away from each side to remove the messy noise.
Features (Bells & Whistles #1)
As per naive implementation, aligning glass plate images using the SSD (which seems to outperform NCC) of raw pixel intensities as an image matching metric actually does quite well for most test images. However, there are occasional misalignments, e.g.
Note that the man's outfit yields high raw pixel intensity values in B, but not R. Hence when aligning R to B based on the SSD of raw pixel intensity, a good displacement vector could backfire with a very poor score.
So to fix the problem and capture this variation, we want to measure image similarity with gradients, so we turn to edge detection. I used Dollar and Zitnick's Fast edge detection using structured forests (pdf). Capable of multi-scale edge detection, this edge detector features superior run-time complexity while maintaining state-of-the-art edge detector results. After changing the image matching metric to take the SSD over the edge response maps of the glass plate images, alignment accuracy improves, e.g.
Measuring image similarity with edge detection alone works well overall, but occasionally we would run into some minor alignment inaccuracies that we didn't have to deal with when we used raw pixel intensity, e.g.
Interestingly, these cases performed better under raw pixel intensity SSD than under edge response SSD. This indicated that the edge detection method was susceptible to losing image-specific information. So to fix this, I changed my image matching metric to use raw pixel SSD if they were comprehensively similar in intensity (if the absolute value of the relative difference over the mean pixel intensities was less than some small threshold), and use edge response SSD if they were not.
Cropping (Bells & Whistles #2)
After alignment, prior to combining the shifted glass plates images into a single color image, we compute the cropping parameters for each individual glass plate image. After generating the color image, we combine these parameters (via tightfit) to crop the final image. Our objective is to filter out as much boundary noise as possible, while keeping as much of the original image as possible.
Approach: Given a single channel image, we first run edge detection to obtain an edge response map. We then horizontally average the edge response map to obtain a vector whose length is the image's height. We compute a threshold equal to 2 standard deviations above the mean over all values in this vector. We perform a sequential search among the first 8% of values in the vector from right to left, to find the index of the first value encountered to be higher than the threshold. Similarly, a sequential search from left to right is performed over the last 8% of the vector. Conceptually, this gives us an approximate location of the semantic boundary of the image. The indices are then saved as the top and bottom cropping bounds respectively. A similar process is done with the vector generated by vertically averaging the edge response map, to obtain the left and right cropping bounds.
Sample cropping results
Automatic white balancing & contrasting (Bells and Whistles #3 & #4)
AWB Approach: We assume that the brightest pixel is white - we then re-scale the pixel intensities for each color channel accordingly. (I did not simulate a neutral illuminant by shifting the average color to gray, since that method made many images appear too "red")
Contrast Approach: Rescale all pixel intensities by YUV luminance value such that the near-brightest pixel (99th percentile of pixel intensity) is 1, and the near-darkest pixel (1st percentile of pixel intensity) is zero.
Colorization (small images): ~1-2 secondsColorization (large images): ~60 secondsColorization + Cropping (small images): ~1-2 secondsColorization + Cropping (large images): ~80-100 secondsColorization + Cropping + AWB + Contrast (small images): ~1-3 secondsColorization + Cropping + AWB + Contrast (large images): ~80-100 seconds