时空特征--Learning Spatiotemporal Features with 3D Convolutional Networks

Learning Spatiotemporal Features with 3D Convolutional Networks ICCV 2015
http://vlg.cs.dartmouth.edu/c3d/
https://github.com/facebook/C3D

本文使用 3D CNN 来分析视频序列，学习到的时空特征称之为 C3D，主要寻找3D CNN 中的最优3D滤波器结构

视频数据的分析是一个很重要的工作，但是也是一个难题。
一个有效的 video descriptor，我们认为需要满足一下四点：1） generic, 2）compact, 3）simple， 4）efficient。

我们的 C3D是多才多艺的：
时空特征--Learning Spatiotemporal Features with 3D Convolutional Networks

3 Learning Features with 3D ConvNets
3.1. 3D convolution and pooling
我们相信 3D CNN 网络适合于时空特征的学习，和 2D CNN 网络相比，3D ConvNet 通过3D 卷积和 3D 池化可以对时间信息进行建模。
时空特征--Learning Spatiotemporal Features with 3D Convolutional Networks

我们的思路是先在一个小的数据库上寻找一个最优的 3D ConvNet 网络结构，然后再在一个大的数据库上进行验证。

Because training deep net-works on large-scale video datasets is very time-consuming, we first experiment with UCF101, a medium-scale dataset, to search for the best architecture.

Common network settings：我们的网络输入是一个小段视频，输出是 101 different actions
网络结构的一些设定，将 UCF101 图像的尺寸归一化到 128 × 171，Videos are split into non-overlapped 16-frame clips which are then used as input to the networks. 输入尺寸是 3 × 16 × 128 × 171，我们也会裁剪一些作为输入，尺寸为3 × 16 × 112 × 112，网络有5个卷积和 5个池化， 2 fully-connected layers and
a softmax loss layer to predict action labels。卷积层中的滤波器个数分别为 64, 128, 256, 256, 256，所有卷积滤波器的 kernal 是 3 × 3 × d，这个d is the kernel temporal depth

According to the findings in 2D ConvNet [37], small receptive fields of 3 × 3 convolution kernels with deeper architectures yield best results. Hence, for our architecture search study we fix the spatial receptive field to 3 × 3 and vary only the temporal depth of the 3D convolution kernels.

Varying network architectures:
时空特征--Learning Spatiotemporal Features with 3D Convolutional Networks

实验结果发现 d=3 是最优的

3.3. Spatiotemporal feature learning
有了最优的卷积核，下面我们设计一个好点的网络，这个受硬件性能的制约
时空特征--Learning Spatiotemporal Features with 3D Convolutional Networks
我们用这个网络提到的特征称之为 C3D

时空特征--Learning Spatiotemporal Features with 3D Convolutional Networks
DeepVideo and C3D use short clips while Convolution pooling [29] uses much longer clips.

时空特征--Learning Spatiotemporal Features with 3D Convolutional Networks

Scene recognition accuracy
时空特征--Learning Spatiotemporal Features with 3D Convolutional Networks

时空特征--Learning Spatiotemporal Features with 3D Convolutional Networks
C3D is much faster than real-time, processing at 313 fps