【论文笔记】Learning from Synthetic Data for Crowd Counting in the Wild

Backto Paper Notes

数据集：人群计数数据集标注困难 =》借助 GTA-5 的地图生成器，设计场景，手动拜访人物模型，自动标注 =>we build a large-scale, diverse synthetic dataset - GCC
模型：借鉴图像分割的经验，提出了 SFCN 模型 =》先以 1 中的数据集预训练模型，再用真实的数据集 fine-tune 模型 =》在 UCF-QNRF 等4 个真实数据集上达到了 SOTA
模型：应用 domain adaptation 来做crowd counting，从而把人工从标注的痛苦中解放出来 => 效果不错，超越了baseline 未达到 SFCN 的SOTA 水准

利用电子游戏模拟器制作的合成数据集，胜在场景可控，标注自动。这种模拟器中训练的思路在自动驾驶等强化学习套路中也经常用到。
【论文笔记】Learning from Synthetic Data for Crowd Counting in the Wild

【论文笔记】Learning from Synthetic Data for Crowd Counting in the Wild
以下 SFCN 代指最右侧以 ResNet101 作为 backbone 的模型。

“k(3,3)-c256-s1-d2” represents the convolutional operation with kernel size of 3 × 3, 256 output channels, stride size of 1 and dilation rate of 2. 喜欢着这种表示法，以后全用这种格式。
ResNet101中，将 conv4_x 中 stride 置为 1，导致 resnet 输出的feature map 是原图的 1/8。在最后的 Regression Layer 用一层 upsample 提升8倍恢复至原尺寸。
膨胀卷积层 Dilation Convolution，可参看膨胀卷积 Dilation Convolution, 压缩 feature map，同时保持感受野不变，达到 feature map 变薄同时信息失真最少的目标。
Spatial Encoder 空间编码器：空间关系卷积建模。可参看
Regression Layer: 直接压扁成一层， 1/8 原图尺寸。再 upsample 8 倍放大，game over。

【论文笔记】Learning from Synthetic Data for Crowd Counting in the Wild

Ideas

Crowd Counting is a pixel-wise task, so many semantic segmentation methods could be inspiring.
Fully convolution architecture is great, receive images with arbitrary size & output the same size result.

What if let ResNet produce same-size featuremap, not 1/8? Slower, but more accurate?