[点云场景语义]-Anisotropic Convolutional Networks for 3D Semantic Scene Completion

Anisotropic Convolutional Networks for 3D Semantic Scene Completion

Anisotropic表示各向异性
CVPR 2020

摘要

原文	译文
As a voxel-wise labeling task, semantic scene completion (SSC) tries to simultaneously infer the occupancy and semantic labels for a scene from a single depth and/or RGB image.	作为体素级标注任务，场景语义填充任务需要同时推测体素是否占用和语义标签，从单个深度图或者结合RGB图。
The key challenge for SSC is how to effectively take advantage of the 3D context to model various objects or stuffs with severe variations in shapes, layouts and visibility.	SSC的关键挑战在于，如何有效的利用3D语义信息来对形状、图层、可见性等变化有效的进行描述。
To handle such variations, we propose a novel module called anisotropic convolution, which properties with ﬂexibility and power impossible for the competing methods such as standard 3D convolution and some of its variations.	为了处理这些问题，我们提出，一种各向异性卷积。
In contrast to the standard 3D convolution that is limited to a ﬁxed 3D receptive ﬁeld, our module is capable of modeling the dimensional anisotropy voxel-wisely.	标准的3D卷积通常将感受野固定，但是我们提出的模块可以在体素上描述各个维度。
The basic idea is to enable anisotropic 3D receptive ﬁeld by decomposing a 3D convolution into three consecutive 1D convolutions, and the kernel size for each such 1D convolution is adaptively determined on the ﬂy.	最核心的思想就是，将3D卷积分解到各个维度的1D卷积，并且1D卷积的核大小是在线自适应的。
By stacking multiple such anisotropic convolution modules, the voxel-wise modeling capability can be further enhanced while maintaining a controllable amount of model parameters.	同过堆叠这样的各向异性卷积，可以进一步提高体素级的描述，同时网络参数量也能有效控制。
Extensive experiments on two SSC benchmarks, NYU-Depth-v2 and NYUCAD, show the superior performance of the proposed method.	本文在NYU-Depth-v2和NYUCAD上进行了实验，验证了提出方法的有效性。

modulation factors

我对文章中AIC模块里的modulation factors感兴趣
各向异性卷积如下图所示
[点云场景语义]-Anisotropic Convolutional Networks for 3D Semantic Scene Completion
示意图还是很好理解的，左上角有一个modulation factors
作者对这个模块的作用是这样描述的

To enable the model to determine the optimal combination of the candidate kernels and consequently adaptively controlling the context to model different voxels, we introduce a modulation module in the AIC module.

作者想要网络自适应的控制不同体素的对语义的影响，提出了这个模块。我感觉这就是一个 self-attention。
[点云场景语义]-Anisotropic Convolutional Networks for 3D Semantic Scene Completion
公式2，里的g就是来衡量不同体素的重要性的，并且g是需要一个softmax来进行归一化的。
作者说gu(·, ·) is realized by a 1-layer 3D convolution with kernel (1 × 1 × 1). 这点和其他点云中的attention还是实现不太一样的。和常见的点云self-attention相比，少了转置乘。
在实验部分，作者也对 Is it necessary to use modulation factors? 进行了实验，得到的结论是肯定的。
分析了

the selected kernel sizes are basically consistent with the object sizes;
the modulation values for different voxels vary a lot within one scene;
the modulation values among the three separable dimensions have significant variation