为什么 numpy 字符串索引数组比 numpy 对象索引数组慢？答案

【问题标题】：Why numpy array of strings indexing is slower than numpy array of object indexing?为什么 numpy 字符串索引数组比 numpy 对象索引数组慢？
【发布时间】：2021-08-10 14:09:49
【问题描述】：

示例代码

import numpy as np
import time


class A:
    def __init__(self, n):
        self.n = n

    def str_n(self):
        return str(self.n)


idx = np.asarray(list(range(30000)))
l_a = []
for i in range(400000):
    l_a.append(A(i))

l_a_arr = np.asarray(l_a)
l_a_str_arr = np.asarray([i.str_n() for i in l_a])


s_time = time.time()
l_a_idx_str_arr = l_a_str_arr[idx].tolist()
cost_time = time.time() - s_time
print("String array cost time is ", cost_time)

s_time = time.time()
l_a_idx_arr = l_a_arr[idx].tolist()
cost_time = time.time() - s_time
print("Class array cost time is ", cost_time)

日志：

字符串数组花费时间为0.0014674663543701172
类数组成本时间为 0.0003917217254638672

更新
重复 1000 次并删除 tolist()

import numpy as np
import time


class A:
    def __init__(self, n):
        self.inner_n = n + 111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111

    def str_n(self):
        return str(self.inner_n)


idx = np.asarray(list(range(30000)))
l_a = []
for i in range(400000):
    l_a.append(A(i))

l_a_arr = np.asarray(l_a)
l_a_str_arr = np.asarray([i.str_n() for i in l_a])

avg_time = []
for i in range(1000):
    s_time = time.time()
    l_a_idx_str_arr = l_a_str_arr[idx].tolist()
    cost_time = time.time() - s_time
    avg_time.append(cost_time)
print("String array cost time with tolist is ", np.average(avg_time))

avg_time1 = []
for i in range(1000):
    s_time = time.time()
    l_a_idx_arr = l_a_arr[idx].tolist()
    cost_time = time.time() - s_time
    avg_time1.append(cost_time)
print("Class array cost time with tolist is ", np.average(avg_time1))

avg_time2 = []
for i in range(1000):
    s_time = time.time()
    l_a_idx_str_arr = l_a_str_arr[idx]
    cost_time = time.time() - s_time
    avg_time2.append(cost_time)
print("String array cost time is ", np.average(avg_time2))

avg_time3 = []
for i in range(1000):
    s_time = time.time()
    l_a_idx_arr = l_a_arr[idx]
    cost_time = time.time() - s_time
    avg_time3.append(cost_time)
print("Class array cost time is ", np.average(avg_time3))

日志：

字符串数组 1000 使用 tolist 的平均花费时间是 0.0037294850349426267
使用 tolist 的类数组 1000 平均成本时间为 0.00030662870407104493
字符串数组 1000 平均花费时间为 0.0014972503185272216
类数组 1000 平均花费时间是 0.0001489844322204589

字符串数组是对象数组的一部分，为什么它的索引花费更多时间？

【问题讨论】：

请移除 .tolist() 调用并再次尝试基准测试。
您需要多次重复您尝试计时的语句才能准确估计执行时间。
@CaptainTrojan 我删除了tolist()，还是一样。

标签： python arrays list numpy memory

【解决方案1】：

Object dtype 数组类似于列表，存储对对象的引用。索引几乎与列表一样快。

String dtype 数组将字符串存储为字节，就像它们存储数字一样。索引单个元素的速度较慢，因为它需要将 numpy 字节转换为 python 字符串（“拆箱”）。

最好“整体”使用数组，而不是反复使用。

【讨论】：