Abstract:
Fine-grained image classification (FGIC) is a challenging task due to small visual differences among inter-subcategories, but large intra-class variations. In this paper, we propose a fusion approach to address FGIC by combining global texture with local patch-based information. The first pipeline extracts deep features from various fixed-size non-overlapping patches and encodes features by sequential modeling using the long short-term memory (LSTM). Another path computes image-level textures at multiple scales using the local binary patterns (LBP). The advantages of both streams are integrated to represent an efficient feature vector for classification. The method is tested on six datasets (e.g., human faces, food-dishes, etc.) using four backbone CNNs. Our method has attained better classification accuracy over existing methods with notable margins