In this work we address joint object category and instance recognition in the context of rapid advances of RGBD (depth) cameras [16, 3]. We study the object recognition problem by collecting a large RGB-D dataset which consists of 31 everyday object categories, 159 object instances and about 100; 000 views of objects with both RGB color and depth. Motivated by local distance learning where elementary distances (over features like SIFT and spin images) can be integrated at a per-view level, we define a view-to-object-instance distance where per-view distances are weighted and merged. We show that the per-instance distance, through jointly learning the perview weights, leads to superior classification performance on object category recognition. More importantly, the per-instance distance allows us to find a sparse solution (through Group- Lasso), where a small subset of representative views of an object are identified and used, and the rest discarded. This not only reduces computational cost but also further increases recognition accuracy. We also empirically compare and validate the use of visual (i.e. RGB) cues and shape (i.e. depth) cues and their combinations.