在 AWS SageMaker Notebook 上使用 Turi Create Object Detection 和 CUDA 8.0答案

【问题标题】：Using Turi Create Object Detection with CUDA 8.0 on AWS SageMaker Notebook在 AWS SageMaker Notebook 上使用 Turi Create Object Detection 和 CUDA 8.0
【发布时间】：2019-11-12 05:41:45
【问题描述】：

正如标题所说，我正在尝试在带有 Python 3.6（conda_amazonei_mxnet_p36 环境）的 AWS SageMaker Notebook 实例上使用 Turi Create。尽管默认安装了 CUDA 10.0，但 CUDA 8.0 也已预先安装，可以使用 notebook 中的以下命令进行选择：

!sudo rm /usr/local/cuda
!sudo ln -s /usr/local/cuda-8.0 /usr/local/cuda

我已使用 nvcc --version 验证了此安装，并且：

$ cd /usr/local/cuda/samples/1_Utilities/deviceQuery
$ sudo make
$ ./deviceQuery

接下来，在我的笔记本中安装 Turi Create 和正确版本的 CUDA 8.0 的 mxnet：

!pip install turicreate==5.4
!pip uninstall -y mxnet
!pip install mxnet-cu80==1.1.0

然后，我准备图像并尝试创建模型：

import turicreate as tc

tc.config.set_num_gpus(-1)
images = tc.image_analysis.load_images('images', ignore_failure=True);
data = images.join(annotations_);
train_data, test_data = data.random_split(0.8)
model = tc.object_detector.create(train_data, max_iterations=50)

运行tc.object_detector.create时会输出以下内容

Using 'image' as feature column
Using 'annotaion' as annotations column
Downloading https://docs-assets.developer.apple.com/turicreate/models/darknet.params
Download completed: /var/tmp/model_cache/darknet.params
Setting 'batch_size' to 32
Using GPUs to create model (Tesla K80, Tesla K80, Tesla K80, Tesla K80, Tesla K80, Tesla K80, Tesla K80, Tesla K80)
Using default 16 lambda workers.
To maximize the degree of parallelism, add the following code to the beginning of the program:
"turicreate.config.set_runtime_config('TURI_DEFAULT_NUM_PYLAMBDA_WORKERS', 32)"
Note that increasing the degree of parallelism also increases the memory footprint.
---------------------------------------------------------------------------
MXNetError                                Traceback (most recent call last)
_ctypes/callbacks.c in 'calling callback function'()

~/anaconda3/envs/amazonei_mxnet_p36/lib/python3.6/site-packages/mxnet/kvstore.py in updater_handle(key, lhs_handle, rhs_handle, _)
     81         lhs = _ndarray_cls(NDArrayHandle(lhs_handle))
     82         rhs = _ndarray_cls(NDArrayHandle(rhs_handle))
---> 83         updater(key, lhs, rhs)
     84     return updater_handle
     85 

~/anaconda3/envs/amazonei_mxnet_p36/lib/python3.6/site-packages/mxnet/optimizer/optimizer.py in __call__(self, index, grad, weight)
   1528                 self.sync_state_context(self.states[index], weight.context)
   1529             self.states_synced[index] = True
-> 1530         self.optimizer.update_multi_precision(index, weight, grad, self.states[index])
   1531 
   1532     def sync_state_context(self, state, context):

~/anaconda3/envs/amazonei_mxnet_p36/lib/python3.6/site-packages/mxnet/optimizer/optimizer.py in update_multi_precision(self, index, weight, grad, state)
    553         use_multi_precision = self.multi_precision and weight.dtype == numpy.float16
    554         self._update_impl(index, weight, grad, state,
--> 555                           multi_precision=use_multi_precision)
    556 
    557 @register

~/anaconda3/envs/amazonei_mxnet_p36/lib/python3.6/site-packages/mxnet/optimizer/optimizer.py in _update_impl(self, index, weight, grad, state, multi_precision)
    535             if state is not None:
    536                 sgd_mom_update(weight, grad, state, out=weight,
--> 537                                lazy_update=self.lazy_update, lr=lr, wd=wd, **kwargs)
    538             else:
    539                 sgd_update(weight, grad, out=weight, lazy_update=self.lazy_update,

~/anaconda3/envs/amazonei_mxnet_p36/lib/python3.6/site-packages/mxnet/ndarray/register.py in sgd_mom_update(weight, grad, mom, lr, momentum, wd, rescale_grad, clip_gradient, out, name, **kwargs)

~/anaconda3/envs/amazonei_mxnet_p36/lib/python3.6/site-packages/mxnet/_ctypes/ndarray.py in _imperative_invoke(handle, ndargs, keys, vals, out)
     90         c_str_array(keys),
     91         c_str_array([str(s) for s in vals]),
---> 92         ctypes.byref(out_stypes)))
     93 
     94     if original_output is not None:

~/anaconda3/envs/amazonei_mxnet_p36/lib/python3.6/site-packages/mxnet/base.py in check_call(ret)
    144     """
    145     if ret != 0:
--> 146         raise MXNetError(py_str(_LIB.MXGetLastError()))
    147 
    148 

MXNetError: Cannot find argument 'lazy_update', Possible Arguments:
----------------
lr : float, required
    Learning rate
momentum : float, optional, default=0
    The decay rate of momentum estimates at each epoch.
wd : float, optional, default=0
    Weight decay augments the objective function with a regularization term that penalizes large weights. The penalty scales with the square of the magnitude of each weight.
rescale_grad : float, optional, default=1
    Rescale gradient to grad = rescale_grad*grad.
clip_gradient : float, optional, default=-1
    Clip gradient to the range of [-clip_gradient, clip_gradient] If clip_gradient <= 0, gradient clipping is turned off. grad = max(min(grad, clip_gradient), -clip_gradient).
, in operator sgd_mom_update(name="", wd="0.0005", momentum="0.9", clip_gradient="0.025", rescale_grad="1.0", lr="0.001", lazy_update="True")

有趣的是，如果我将 CUDA 10.0 与 Turi Create 5.6 一起使用：

!pip install turicreate==5.6
!pip uninstall -y mxnet
!pip install mxnet-cu100==1.4.0.post0

笔记本仍然失败，但如果我立即卸载 turicreate 和 mxnet-cu100 并再次尝试上述 CUDA 8.0 步骤，它可以正常工作。

上次它工作时，我在重启实例后尝试了pip freeze > requirements.txt，然后尝试了pip install -r requirements.txt，但我仍然遇到与上述相同的错误（除非我先尝试使用 CUDA 10.0）。这里发生了什么？任何建议表示赞赏。

【问题讨论】：

标签： python-3.x amazon-web-services turi-create

【解决方案1】：

您从 mxnet 1.1.0 更新到 1.4.0 是正确的解决方法。看起来该错误与 CUDA 版本无关，而与 MXNet 本身有关。

mxnet 1.1.0 的https://github.com/apache/incubator-mxnet 源代码没有sgd_mom_update 函数的lazy_update 参数。

您可以通过比较 mxnet 发布标签 1.4.0 的优化器代码中的 sgd_mom_update 函数调用来观察这一点

https://github.com/apache/incubator-mxnet/blob/a03d59ed867ba334d78d61246a1090cd1868f5da/python/mxnet/optimizer/optimizer.py#L536

使用 mxnet 发布标签 1.1.0 的优化器代码

https://github.com/apache/incubator-mxnet/blob/07a83a0325a3d782513a04f47d711710972cb144/python/mxnet/optimizer.py#L517

这些更改包含在mxnet>=1.3.0 中，这就是您在mxnet-cu100==1.4.0.post0 上成功测试的原因。

【讨论】：