## 摘要

Recently, image representation built upon Convolutional Neural Network (CNN) has been shown to provide effective descriptors for image search, outperforming pre-CNN features as short-vector representations. Yet such models are not compatible with geometry-aware re-ranking methods and still outperformed, on some particular object retrieval benchmarks, by traditional image search systems relying on precise descriptor matching, geometric re-ranking, or query expansion. This work revisits both retrieval stages, namely initial search and re-ranking, by employing the same primitive information derived from the CNN. We build compact feature vectors that encode several image regions without the need to feed multiple inputs to the network. Furthermore, we extend integral images to handle max-pooling on convolutional layer activations, allowing us to efficiently localize matching objects. The resulting bounding box is finally used for image re-ranking. As a result, this paper significantly improves existing CNN-based recognition pipeline: We report for the first time results competing with traditional methods on the challenging Oxford5k and Paris6k datasets.

## 引言

1. 从卷积层激活中提取出一个紧凑的图像表示，它编码了多个图像区域并且不需要向网络多次输入数据进行计算，实现上类似于Fast-RCNN/Faster-RCNN
2. 扩展了广义平均算法（一篇2009年的论文），通过积分图计算最大池化操作，这种方式可以直接在2D特征图上进行目标定位（如下图一所示）；
3. 在图像重排序阶段应用定位算法，实现了一个简单但是高效的查询扩展方法。

## 背景

$f_{\Omega}=[f_{\Omega, 1}, ..., f_{\Omega, i}, ..., f_{\Omega, K}], with\ f_{\Omega, i}=max_{p\in \Omega}X_{i}(p)$

## 区域编码

### 区域特征向量

$f_{R}=[f_{R,1}, ..., f_{R,i}, ..., f_{R,K}]^{T}$

### R-MAC

R-MAC( regional maximum activation of convolutions)计算如下：

1. 采集多个尺度下的子区域；
2. 分别计算子区域特征向量；
3. 分别执行后处理（L2归一化 + PCA-白化 + L2归一化）；
4. 求和子区域特征向量得到单个特征向量；
5. 最后执行$$L2$$归一化。

## 目标定位

### 近似积分最大池化操作

$\tilde{f}_{R,i}=(\sum_{p\in R}X_{i}(p)^{\alpha})^{\frac{1}{\alpha}}\approx \max_{p\in X_{i}(p)}=f_{R, i}$

1. $$\alpha$$越大，误差越低；
2. 图像区域越大，误差越高。

### 窗口检测

$\hat{R}=argmax _{R\subseteq \Omega}\frac{\tilde{f}_{R}^{T}q}{\left\|\tilde{f}_{R} \right\| \left\|q \right\| }$

## 实验

### 设置

• 数据集：Oxford5k(5063张图像)Paris6k(6412张图像)，以及100k的干扰图像；
• 评估标准：mean Average Precision (mAP)
• PCA：测试Oxford5k数据集时从Paris6k中学习，反之亦然；
• 卷积特征：
1. AlexNetImageNet预训练，提取最后的池化层输出，特征通道数为256
2. VGG16ImageNet预训练，提取最后的池化层输出，特征通道数为512

## 小结

R-MAC使用固定网格进行子区域提取和融合，不可避免的会融合大量背景区域，增加背景噪声的干扰；另外AML基于分支定界算法的优化进行目标定位，相对于基于深度学习的目标检测算法而言缺乏检测准确率和检测效率。这些都是后续论文提出的优化路径。