COS 429 - Computer Vision

Fall 2017

Course home Outline and Lecture Notes Assignments Featured Projects


Assignment 3: Tracking

Due Sunday, Nov. 19


In this assignment you will be building a face tracker based on the Lukas-Kanade algorithm. For background, you should read the Lucas-Kanade paper and look at the notes for lectures 13 and 14.


Part 1. Preliminaries (20 points)


Do this:


1.1 Motion

We will be working with two motion models:

  1. Translation by $(u,v)$, where the coordinate $(x,y)$ is updated to a new location $(x',y')$ according to $$ x'=x+u $$ $$ y'=y+v $$
  2. Translation by $(u,v)$ and scale by $s$, where the scale $s$ is defined by the change of the object width from $w$ to $w'$ as: $$ s = \frac{w'-w}{w} $$ Thus $s=0$ means no scale, $s=1$ means the object doubles in size, $s=-0.5$ means it shrinks in half, and $s=-1$ means it shrinks to a singularity.

    To define the motion of a point $(x,y)$ due to scale, one needs to define the "center" of motion. We will refer to this arbitrary point as $(x_0,y_0)$. Thus scaling the point $(x,y)$ means $$ x' = x+(x-x_0)*s $$ $$ y' = y+(y-y_0)*s $$

    After scaling, we apply the translation also, producing the motion equations: $$ x' = x+u+(x-x_0)*s $$ $$ y' = y+v+(y-y_0)*s; $$

In the code, the directory uvs/ contains functions that manipulate $(u,v,s)$ motion models. The data structure they use is simply a row vector of 5 elements: [u v s x0 y0]. For example, the function

    wc = uvsCWarp(mot, c)

takes a motion model mot = [u v s x0 y0] and coordinate c = [x,y] and produces the new warped point wc = [x',y'].

Note that most of the functions are written in a vectorized form where they accept multiple rows of motions (and or points), hence you see a lot of code such as mot( : , U ).


Do this and turn in:


1.2 Rectangles

In frame-to-frame alignment, our goal is to estimate these parameters, $(u,v)$ or $(u,v,s)$, from a local part of a given pair of images. For simplicity, we will limit ourselves to rectangular areas of the image. Our rectangles are defined as rect = [xmin xmax ymin ymax] and the directory rect/ contains functions to manipulate rectangles. (Technical note: the sides of a rect can take on non-integer values, where a rectangle may include only part of some pixels. Pixels are centered on the integer grid, thus a rectangle that contains exactly pixel [1,3] would be rect = [0.5 1.5 2.5 3.5].)

Given 2 rectangles (of the same aspect ratio), we can compute the $(u,v,s)$ motion between them. This could be useful to define an initial motion for the tracking if you have a guess of 2 rectangles that define the object (e.g. by running your face detector from assignment 2!).


Do this and turn in:


1.3 Image sequence

Included with the starter code (in the data/ subdirectory) are the 'girl' and 'david' video sequences of faces from http://vision.ucsd.edu/~bbabenko/project_miltrack.html .

Each frame of the video is just an image (here it's actually stored as set of png files). We could represent the image as just a Matlab matrix, as you have done in the previous assignments, but we will be cutting and warping images and it is useful for an image to be able to have a coordinate system that is not necessarily rooted at the top-left corner. Here we will use a "coordinate image" struct (often called coimage or coi in the code). It is just a wrapper around the matrix with some extra fields, for example:

    coi = 
        im: [240x320 double]
        origin: [30 20]
        label: 'img00005'
        level: 1 

The level field will be useful later to keep track of which pyramid-level this image represents.

The directory coi/ contains functions to create and manipulate these images. See the constructor function coimage.m for more information.

The class FaceSequence (coi/FaceSequence.m) is there to simplify the access to the tracking sequences downloaded above. To initialize:

    fs = FaceSequence('path/to/data/girl');

Then, to access e.g. the 5th image and read it into a coimage structure, you can do:

    coi = fs.readImage(5); 

(NOTE: this is probably the image img00004.png - don't be confused.) If you want to access a sequence of images, say every 3rd image starting from the 2nd, do:

    fs.next=2; fs.step=3;
    coi1 = fs.readNextImage(); % the 2nd
    coi2 = fs.readNextImage(); % the 5th
    ...

Additionally, fs contains the "ground truth" rectangles stored with the clip in

     rect = fs.gt_rect(1,:); % rectangle for 1st image

Beware: only some frames have valid ground truth rectangles, otherwise this rect = [0 0 0 0];

Do this and turn in:


Part 2: LK at a single level (20 points)

2.1 (u,v) motion

Review the tracking lectures. The Lukas-Kanade algorithm repeatedly warps the "current" image backward according to the current motion to be similar to the "previous" image inside the area defined by the "previous rectangle".

Let $N$ be the number of pixels inside the rectangle. Recall that we need to compute a motion update $x = (u,v)$ (for translation-only motion) that satisfies: $$ Ax=b $$ where $A$ is an $N \times 2$ matrix, each of whose rows is the image gradient $(dx, dy$ at some pixel of the previous image, and $b$ is an $N \times 1$ column vector of errors (image intensity differences) between the previous and current image. To solve this with least squares, we compute $$ x = (A^T A)^{-1} A^T b. $$

The new combined motion (in motion + update) should be

    new_mot_u = mot_u + x_u;
    new_mot_v = mot_v + x_v;
Notice that $A$, hence $(A^T A)^{-1}$, does not change between iterations: we only need to re-warp the current image according to the updated motion.

The function LKonCoImage implements the LK algorithm on a single level:

    [mot, err, imot] = LKonCoImage(prevcoi, curcoi, prect, init_mot, params)

A default set of params can be generated with LKinitParams();

Do this and turn in:


2.2 (u,v,s) motion

Now we will implement motion with scale. The formulas are very similar: $x = (u, v, s)$ is the mot_update we want to solve for, and so each row of $A$ has another column: $$ A_i = (dx_i \; dy_i \; ww_i) $$ where $ww = dx \cdot (x-x_0) + dy \cdot (y-y_0)$ for each pixel inside the rectangle. (Thus, $A^T A$ will be a $3 \times 3$ matrix.)

Finally the motion update is a bit more complex:

    new_mot_u = mot_u + x_u + x_u * mot_s;
    new_mot_v = mot_v + x_v + x_v * mot_s;
    new_mot_s = mot_s + x_s + mot_s * x_s;

Do this and turn in:


Part 3: Multi-level LK (35 points)

The function LKonPyaramid.m implements the multi-resolution LK. It should call LKonCoImage() for each level of an Gaussian image pyramid. The image pyramid is built using

    pyr = coPyramid(coi, levs);

Each time we go "up" a level the image is subsampled by a factor of 2. The origin of the coordinate system is kept at the same pixel, but the width and height of any object is reduced to 1/2. A feature at coordinate x_L2 on level 2 would have coordinate x_L1 = x_L2 * 2; on level 1. The function rectChangeLevel() implements level conversions for rectangles.

The first task is to figure out which levels of the pyramid we want to work on. The function defineActiveLevels should return a vector [L_start : L_end], inclusive, of the levels we can work on. L_start for now should be the lowest level of the given pyramid (since we are willing to work on very big images), but the top level needs to be computed so that the number of pixels in the area of the rectangle is not smaller than specified in the parameter min_pix(1).

We also need to update the motion as we move from level to level. This is done by calling the function uvsChangeLevel() (in the uvs/ subdirectory).


Do this and turn in:


Part 4: Incremental tracking for a sequence (15 points)

Function LKonSequence runs LKonPyramid on a sequence in an incremental way: 1→2, 2→3, 3→4, ...


Do this and turn in:


Part 5: Face tracking in your own video! (10 points)

Try and run your finished algorithm on your own video sequence, using initial rectangles produced by your face detection algorithm from assignment 2!

If you have a different partner than in assignment 2 you may use either partner's detector.

This is a more open-ended task to help you practice for working on your final projects, where you will not have instructions to guide you. Use your best judgement as to how to solve the problems described below, keeping in mind the goal of demonstrating your LK implementation using a dataset captured "in the wild."

Do this and turn in:


Submitting

Please submit all of your .m files, and the requested images showing the results of your experiments (e.g., video frames with correctly or incorrectly tracked rectangles). Please include one README.pdf file containing a description of all experiments and submitted images, the answers to all questions, and any relevant implementation notes. Please be sure that you do not submit a copy of the data.

The Dropbox link to submit your assignment is here.


Last update 23-Jan-2018 10:17:01