Learning Spatiotemporal Features with 3D Convolutional Networks ICCV 2015
http://vlg.cs.dartmouth.edu/c3d/
https://github.com/facebook/C3D

本文使用 3D CNN 来分析视频序列,学习到的时空特征称之为 C3D,主要寻找3D CNN 中的最优3D滤波器结构

视频数据的分析是一个很重要的工作,但是也是一个难题。
一个有效的 video descriptor,我们认为需要满足一下四点:1) generic, 2)compact, 3)simple, 4)efficient。

我们的 C3D是多才多艺的:
时空特征--Learning Spatiotemporal Features with 3D Convolutional Networks

3 Learning Features with 3D ConvNets
3.1. 3D convolution and pooling
我们相信 3D CNN 网络适合于时空特征的学习,和 2D CNN 网络相比,3D ConvNet 通过3D 卷积和 3D 池化 可以对时间信息进行建模。
时空特征--Learning Spatiotemporal Features with 3D Convolutional Networks

我们的思路是先在一个小的数据库上寻找一个最优的 3D ConvNet 网络结构,然后再在一个大的数据库上进行验证。

Because training deep net-works on large-scale video datasets is very time-consuming, we first experiment with UCF101, a medium-scale dataset, to search for the best architecture.

Common network settings: 我们的网络输入是一个小段视频,输出是 101 different actions
网络结构的一些设定, 将 UCF101 图像的尺寸归一化到 128 × 171,Videos are split into non-overlapped 16-frame clips which are then used as input to the networks. 输入尺寸是 3 × 16 × 128 × 171,我们也会裁剪一些作为输入, 尺寸为3 × 16 × 112 × 112,网络有5个卷积和 5个池化, 2 fully-connected layers and
a softmax loss layer to predict action labels。 卷积层中的滤波器个数分别为 64, 128, 256, 256, 256,所有卷积滤波器的 kernal 是 3 × 3 × d, 这个d is the kernel temporal depth

According to the findings in 2D ConvNet [37], small receptive fields of 3 × 3 convolution kernels with deeper architectures yield best results. Hence, for our architecture search study we fix the spatial receptive field to 3 × 3 and vary only the temporal depth of the 3D convolution kernels.

Varying network architectures:
时空特征--Learning Spatiotemporal Features with 3D Convolutional Networks

实验结果发现 d=3 是最优的

3.3. Spatiotemporal feature learning
有了最优的 卷积核,下面我们设计一个好点的网络,这个受硬件性能的制约
时空特征--Learning Spatiotemporal Features with 3D Convolutional Networks
我们用这个网络提到的特征称之为 C3D
时空特征--Learning Spatiotemporal Features with 3D Convolutional Networks

时空特征--Learning Spatiotemporal Features with 3D Convolutional Networks
DeepVideo and C3D use short clips while Convolution pooling [29] uses much longer clips.

时空特征--Learning Spatiotemporal Features with 3D Convolutional Networks

时空特征--Learning Spatiotemporal Features with 3D Convolutional Networks

Scene recognition accuracy
时空特征--Learning Spatiotemporal Features with 3D Convolutional Networks

时空特征--Learning Spatiotemporal Features with 3D Convolutional Networks
C3D is much faster than real-time, processing at 313 fps

相关文章:

  • 2021-05-08
  • 2021-10-15
  • 2021-11-23
  • 2021-05-17
  • 2021-05-01
  • 2021-06-12
猜你喜欢
  • 2021-10-13
  • 2021-12-27
  • 2021-09-08
  • 2021-07-03
  • 2021-07-10
  • 2021-07-07
  • 2021-09-26
相关资源
相似解决方案