CUDA 2D 矩阵位置/索引对于 dataframe.to_numpy() 和 numpy 2D 数组的行为不同答案

【问题标题】：CUDA 2D matrix location/indexing behave differently for dataframe.to_numpy() and numpy 2D arrayCUDA 2D 矩阵位置/索引对于 dataframe.to_numpy() 和 numpy 2D 数组的行为不同
【发布时间】：2021-10-25 11:07:45
【问题描述】：

我已经尝试了下面的代码，如果我自己创建一个 numpy 二维数组并使用 dataframe.to_numpy() 创建二维数组，结果会有所不同。谁能帮忙解释一下原因？

如果我使用a = input_matrix.to_numpy() 或a = np.array([[1, 100, 200, 300], [1, 100, 200, 300], [1, 100, 200, 300]]) 的结果是不同的

a = input_matrix.to_numpy() 返回以下内容。我什至尝试在 to_numpy() 之后转置 a (by a = a.T) 但输出仍然相同。任何人都可以提出一种可以成功地从 to_numpy 转置该矩阵的方法吗？

input array is
[[  1. 100. 200. 300.]
 [  1. 100. 200. 300.]
 [  1. 100. 200. 300.]]
returned array is
[  1.   1.   1. 100. 200. 300.]

而a = np.array([[1, 100, 200, 300], [1, 100, 200, 300], [1, 100, 200, 300]]) 返回以下内容

input array is
[[  1. 100. 200. 300.]
 [  1. 100. 200. 300.]
 [  1. 100. 200. 300.]]
returned array is
[  1. 100. 200. 300. 200. 100.]

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import pandas as pd
import os
import numpy as np

_path = r"D:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29910\bin\Hostx64\x64"

if os.system("cl.exe"):
    os.environ['PATH'] += ';' + _path
if os.system("cl.exe"):
    raise RuntimeError("cl.exe still not found, path probably incorrect")

input_matrix = pd.DataFrame(data={'a': [1, 1, 1], 'b': [100, 100, 100], 'c': [200, 200, 200], 'd': [300, 300, 300]})
a = input_matrix.to_numpy()
# a = np.array([[1, 100, 200, 300], [1, 100, 200, 300], [1, 100, 200, 300]])
a = a.astype(np.float32)
print('input array is')
print(a)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)

a_out = np.zeros(6)
a_out = a_out.astype(np.float32)
a_out_gpu = cuda.mem_alloc(a_out.nbytes)
cuda.memcpy_htod(a_out_gpu, a_out)

mod = SourceModule("""
  __global__ void matrix_location_trial(float *in_matrix, float *out_matrix)
  {
     out_matrix[0] = in_matrix[0];
     out_matrix[1] = in_matrix[1];
     out_matrix[2] = in_matrix[2];
     out_matrix[3] = in_matrix[3];
     out_matrix[4] = in_matrix[6];
     out_matrix[5] = in_matrix[9];
  }
  """)
      
func = mod.get_function("matrix_location_trial")
func(a_gpu, a_out_gpu, block=(1,1,1))

returned_array = np.empty_like(a_out)
cuda.memcpy_dtoh(returned_array, a_out_gpu)
print('returned array is')
print(returned_array)

【问题讨论】：

使用 .to_numpy() 时您的输入数组有 3 行 x 4 列，而使用 NumPy 手动创建时有 4 行 x 4 列。如果你在两种情况下传递不同的输入，输出就会不同。
抱歉这里的错误，但在将其更正为 "a = np.array([[1, 100, 200, 300], [1, 100, 200, 300], [1, 100, 200, 300]])”，返回结果还是不一样
在 numpy 中转置不会改变底层存储顺序，这就是为什么使用转置不会改变任何东西的原因
那么有没有其他的转置方式可以改变底层存储顺序呢？

标签： python pandas cuda pycuda

【解决方案1】：

两种情况的存储顺序不同：

数据框

input_matrix = pd.DataFrame(
    data={
        'a': [1, 1, 1], 
        'b': [100, 100, 100], 
        'c': [200, 200, 200], 
        'd': [300, 300, 300]
    }
)
a = input_matrix.to_numpy().astype(np.float32)
print(a.flags)

输出：

C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False

NumPy 数组

a = np.array(
    [[1, 100, 200, 300], [1, 100, 200, 300], [1, 100, 200, 300]],
    dtype=np.float32
)
print(a.flags)

输出：

C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False

这两种情况的区别在于标志 C_CONTIGUOUS 和 F_CONTIGUOUS 的值。

使用np.array() 创建的数组默认为C_CONTIGUOUS，而使用其他方式创建的numpy 数组不能保证相同，例如在本例中使用input_matrix.to_numpy()。

要解决此问题，您只需在将数据复制到 GPU 内存之前再次创建数组 C_CONTIGUOUS，如下所示：

a = input_matrix.to_numpy()
a = a.astype(np.float32)

# Change order to C_CONTIGUOUS
a = a.copy(order="C")

print('input array is')
print(a)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)

添加该行后，我在两种情况下都能得到以下输出：

[[  1. 100. 200.]
[300. 200. 100.]]

数组的标志C_CONTIGUOUS 和F_CONTIGUOUS 之间的区别与数组在内存中的存储方式有关。 C语言以行优先顺序存储数据，而Fortran以列优先顺序存储数据。

NumPy 支持以两种方式存储您的数据。您可以阅读有关存储的更多信息here。

【讨论】：