Tensorflow gpu 使用 gpu-ram，但不使用计算单元？ | CUPTI_ERROR_INSUFFICIENT_PRIVILEGES答案

【问题标题】：Tensorflow gpu uses gpu-ram, but not compute units? | CUPTI_ERROR_INSUFFICIENT_PRIVILEGESTensorflow gpu 使用 gpu-ram，但不使用计算单元？ | CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
【发布时间】：2020-07-30 04:14:53
【问题描述】：

我正在尝试使用通过 pip 安装的 tensorflow-gpu 2.1.0。

问题是：windows10 上的任务管理器显示 GPU 的使用率几乎为零。使用量为 2% 到 5%。但 ram 几乎 100 % 使用。 tasker-manager 显示未使用 gpu (GTX 1660 Ti) 的原因可能是什么？

使用 nvidia-smi 我得到了不同的图像：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 445.87       Driver Version: 445.87       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 166... WDDM  | 00000000:10:00.0  On |                  N/A |
| 79%   64C    P2   109W / 130W |   5964MiB /  6144MiB |     89%      Default |

我使用 CUDA 10.1

TensorFlow 的警告是：

2020-04-16 21:07:55.541837: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-16 21:07:58.416796: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-04-16 21:07:58.450054: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:10:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
coreClock: 1.845GHz coreCount: 24 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 268.26GiB/s
2020-04-16 21:07:58.450406: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-16 21:07:58.455452: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-16 21:07:58.459642: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-04-16 21:07:58.461515: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-04-16 21:07:58.466455: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-04-16 21:07:58.469085: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-04-16 21:07:58.479479: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-16 21:07:58.480206: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-16 21:07:58.480629: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-04-16 21:07:58.482300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:10:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
coreClock: 1.845GHz coreCount: 24 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 268.26GiB/s
2020-04-16 21:07:58.482677: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-16 21:07:58.482875: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-16 21:07:58.483040: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-04-16 21:07:58.483203: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-04-16 21:07:58.483355: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-04-16 21:07:58.483529: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-04-16 21:07:58.483712: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-16 21:07:58.484448: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-16 21:07:59.249742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-16 21:07:59.250043: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-04-16 21:07:59.250203: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2020-04-16 21:07:59.251187: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4625 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:10:00.0, compute capability: 7.5)
Found 5338 images belonging to 4 classes.
Found 3554 images belonging to 4 classes.
WARNING:tensorflow:sample_weight modes were coerced from
  ...
    to  
  ['...']
WARNING:tensorflow:sample_weight modes were coerced from
  ...
    to  
  ['...']
2020-04-16 21:08:11.464027: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-16 21:08:12.081246: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-16 21:08:13.727563: W tensorflow/stream_executor/gpu/redzone_allocator.cc:312] Internal: Invoking GPU asm compilation is supported on Cuda non-Windows platforms only
Relying on driver to perform ptx compilation. This message will be only logged once.
2020-04-16 21:08:15.806688: I tensorflow/core/profiler/lib/profiler_session.cc:225] Profiler session started.
2020-04-16 21:08:15.806850: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1259] Profiler found 1 GPUs
2020-04-16 21:08:15.808769: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cupti64_101.dll
2020-04-16 21:08:15.909368: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1307] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
2020-04-16 21:08:15.910677: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1346] function cupti_interface_->ActivityRegisterCallbacks( AllocCuptiActivityBuffer, FreeCuptiActivityBuffer)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
2020-04-16 21:08:16.092575: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1329] function cupti_interface_->EnableCallback( 0 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error CUPTI_ERROR_INVALID_PARAMETER
2020-04-16 21:08:16.092946: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:88]  GpuTracer has collected 0 callback api events and 0 activity events.
WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (0.338369). Check your callbacks.

我想突出显示错误：CUPTI_ERROR_INSUFFICIENT_PRIVILEGES

目前不足的代码是：

import argparse

from datetime import datetime
import itertools
from six.moves import range

import io
import matplotlib.pyplot as plt
import numpy as np
import sklearn.metrics

import tensorflow as tf
from tensorflow.keras import applications
from tensorflow.keras.callbacks import TensorBoard, ReduceLROnPlateau, ModelCheckpoint, EarlyStopping, LambdaCallback
from tensorflow.keras.layers import Dense, Dropout, GlobalAveragePooling2D
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.image import ImageDataGenerator


def plot_confusion_matrix(cm, class_names):
    """
    Returns a matplotlib figure containing the plotted confusion matrix.

    Args:
    cm (array, shape = [n, n]): a confusion matrix of integer classes
    class_names (array, shape = [n]): String names of the integer classes
    """

    figure = plt.figure(figsize=(8, 8))
    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
    plt.title("Confusion matrix")
    plt.colorbar()
    tick_marks = np.arange(len(class_names))
    plt.xticks(tick_marks, class_names, rotation=45)
    plt.yticks(tick_marks, class_names)

    # Normalize the confusion matrix.
    cm = np.around(cm.astype('float') / cm.sum(axis=1)[:, np.newaxis], decimals=2)

    # Use white text if squares are dark; otherwise black.
    threshold = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        color = "white" if cm[i, j] > threshold else "black"
        plt.text(j, i, cm[i, j], horizontalalignment="center", color=color)

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    return figure


def create_resnet50(img_h: int, img_w: int, num_classes: int):
    # define our MLP network
    base_model = applications.resnet50.ResNet50(weights=None, include_top=False, input_shape=(img_h, img_w, 3))

    x = base_model.output
    x = GlobalAveragePooling2D()(x)
    x = Dropout(rate=0.3)(x)
    predictions = Dense(num_classes, activation='softmax')(x)
    mdl = Model(inputs=base_model.input, outputs=predictions)
    return mdl


def plot_to_image(figure):
    """Converts the matplotlib plot specified by 'figure' to a PNG image and
    returns it. The supplied figure is closed and inaccessible after this call."""
    # Save the plot to a PNG in memory.
    buf = io.BytesIO()
    plt.savefig(buf, format='png')
    # Closing the figure prevents it from being displayed directly inside
    # the notebook.
    plt.close(figure)
    buf.seek(0)
    # Convert PNG buffer to TF image
    image = tf.image.decode_png(buf.getvalue(), channels=4)
    # Add the batch dimension
    image = tf.expand_dims(image, 0)
    return image


def log_confusion_matrix(epoch, logs):
    # Use the model to predict the values from the validation dataset.
    # create list of 256 images, labels
    itx = 256 // bch_size
    test_images, test_labels_raw = [], []
    for i in range(itx):

        tmp_img, tmp_lbs = next(val_gen)
        test_images.extend(tmp_img)
        test_labels_raw.extend(tmp_lbs)

    test_pred_raw = model.predict(np.array(test_images))
    test_pred = np.argmax(test_pred_raw, axis=1)
    test_labels = np.argmax(test_labels_raw, axis=1)

    # Calculate the confusion matrix.
    cm = sklearn.metrics.confusion_matrix(test_labels, test_pred)
    # Log the confusion matrix as an image summary.
    figure = plot_confusion_matrix(cm, class_names=[x for x in val_gen.class_indices.values()])
    cm_image = plot_to_image(figure)

    # Log the confusion matrix as an image summary.
    with file_writer_cm.as_default():
        tf.summary.image("Confusion Matrix", cm_image, step=epoch)


def run(train_generator, test_generator, epcs: int, mdl: Model, opt):
    # train the model
    mdl.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy', 'mse', ])

    stopper = EarlyStopping(monitor='val_loss', patience=min(epcs / 16, 10), mode='auto',
                            restore_best_weights=True)

    checker = ModelCheckpoint(monitor='val_loss', filepath='weights.{epoch:03d}.hdf5',
                              save_best_only=True, save_freq='epoch')

    shower = TensorBoard(histogram_freq=1)

    reducer = ReduceLROnPlateau(factor=0.6, patience=10, min_delta=1e-4, cooldown=10)

    cm_callback = LambdaCallback(on_epoch_end=log_confusion_matrix)

    history = model.fit(train_generator, epochs=epcs, verbose=0,
                        validation_data=test_generator,
                        callbacks=[stopper, checker, shower, reducer, cm_callback]
                        )

    return history


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('train_path', type=str, help='Path to the train main folder of files.')
    parser.add_argument('test_path', type=str, help='Path to the test main folder of files.')
    parser.add_argument('new_model', type=bool, help='Create new model, or load from file.')
    parser.add_argument('-m', '--model_path', type=str, help='path to model.')
    args = parser.parse_args()

    train_p = args.train_path
    test_p = args.test_path
    is_new = args.new_model
    model_path = args.model_path

    img_height, img_width = 214, 214

    file_writer_cm = tf.summary.create_file_writer('logs/cm')

    model = create_resnet50(img_height, img_width, num_classes=4) if is_new else load_model(model_path)
    adam = Adam(lr=0.0001)

    train_datagen = ImageDataGenerator(
        rescale=1. / 255,
        horizontal_flip=True,
        vertical_flip=True,
        rotation_range=90,
        width_shift_range=0.1,
        height_shift_range=0.1,
        zoom_range=0.2
    )

    validation_datagen = ImageDataGenerator(
        rescale=1.255
    )
    bch_size = 16

    train_gen = train_datagen.flow_from_directory(directory=train_p, target_size=(img_height, img_width),
                                                  batch_size=bch_size)
    val_gen = validation_datagen.flow_from_directory(directory=test_p, target_size=(img_height, img_width),
                                                     batch_size=bch_size)

    h = run(train_gen, val_gen, 100, model, adam)

    m_name = 'Model_resnet50_epoch{}_score{:3.2f}.hdf5'.format(100, min(h.history['val_loss']))
    model.save(m_name)

我真的想提前谢谢你。我真的很感激！

【问题讨论】：

标签： python tensorflow gpu

【解决方案1】：

这已经在很多地方讨论过了，一般来说问题是你的 GPU 上的权限。在 Ubuntu 中，您需要创建一个文件/etc/modprobe.d/nvidia-kernel-common.conf，其中包含

options nvidia "NVreg_RestrictProfilingToAdminUsers=0"

您需要重新启动才能生效。您还应该运行

sudo update-initramfs -u

确保上述命令生效。（有很多地方提到了创建文件，但没有提到 update-initramfs；直到我执行该命令并再次重新启动，它才对我有用。）

【讨论】：