require recognition of highly localized attributes of objects while being invariant to their pose and location in the image
Part-based models
construct representations by localizing parts and extracting features conditioned on their detected locations
Holistic models
onstruct a representation of the entire image directly. These include classical image representations
1. Bilinear CNNs for Fine-grained Visual Recognition
Abstract
These networks represent an image as a pooled outer product of features derived from two CNNs and capture localized feature interactions in a translationally invariant manner.
Key insight
several widely-used texture representations can be written as a pooled outer product of two suitably designed features