介绍及解决方案代码
np.einsum,确实很难被击败,但是在极少数情况下,你仍然可以击败它,如果你可以将matrix-multiplication 带入计算。经过几次试验,您似乎可以引入matrix-multiplication with np.dot 以超越np.einsum('ia,aj,ka->ijk', A, B, C) 的性能。
基本思想是我们将“all einsum”操作分解为np.einsum和np.dot的组合,如下所示:
-
A:[i,a] 和 B:[a,j] 的求和是用 np.einsum 完成的,以便我们得到 3D array:[i,j,a]。
- 然后将这个 3D 数组重新整形为
2D array:[i*j,a],并将第三个数组 C[k,a] 转置为 [a,k],目的是在这两者之间执行 matrix-multiplication,得到 [i*j,k] 作为矩阵积,因为我们在那里丢失了索引[a]。
- 产品被重新整形为
3D array:[i,j,k],用于最终输出。
这是目前讨论的第一个版本的实现 -
import numpy as np
def tensor_prod_v1(A,B,C): # First version of proposed method
# Shape parameters
m,d = A.shape
n = B.shape[1]
p = C.shape[0]
# Calculate \sum_a A[i,a] B[a,j] to get a 3D array with indices as (i,j,a)
AB = np.einsum('ia,aj->ija', A, B)
# Calculate entire summation losing a-ith index & reshaping to desired shape
return np.dot(AB.reshape(m*n,d),C.T).reshape(m,n,p)
由于我们要对所有三个输入数组的 a-th 索引求和,因此可以使用三种不同的方法对第 a 个索引求和。前面列出的代码用于(A,B)。因此,我们还可以使用(A,C) 和(B,C) 为我们提供另外两个变体,如下所示:
def tensor_prod_v2(A,B,C):
# Shape parameters
m,d = A.shape
n = B.shape[1]
p = C.shape[0]
# Calculate \sum_a A[i,a] C[k,a] to get a 3D array with indices as (i,k,a)
AC = np.einsum('ia,ja->ija', A, C)
# Calculate entire summation losing a-ith index & reshaping to desired shape
return np.dot(AC.reshape(m*p,d),B).reshape(m,p,n).transpose(0,2,1)
def tensor_prod_v3(A,B,C):
# Shape parameters
m,d = A.shape
n = B.shape[1]
p = C.shape[0]
# Calculate \sum_a B[a,j] C[k,a] to get a 3D array with indices as (a,j,k)
BC = np.einsum('ai,ja->aij', B, C)
# Calculate entire summation losing a-ith index & reshaping to desired shape
return np.dot(A,BC.reshape(d,n*p)).reshape(m,n,p)
根据输入数组的形状,不同的方法会产生不同的加速比,但我们希望所有方法都比all-einsum 方法更好。性能数字在下一节中列出。
运行时测试
这可能是最重要的部分,因为我们尝试使用所提议方法的三种变体来研究加速数字
问题中最初提出的all-einsum 方法。
数据集 #1(等形数组):
In [494]: L1 = 200
...: L2 = 200
...: L3 = 200
...: al = 200
...:
...: A = np.random.rand(L1,al)
...: B = np.random.rand(al,L2)
...: C = np.random.rand(L3,al)
...:
In [495]: %timeit tensor_prod_v1(A,B,C)
...: %timeit tensor_prod_v2(A,B,C)
...: %timeit tensor_prod_v3(A,B,C)
...: %timeit np.einsum('ia,aj,ka->ijk', A, B, C)
...:
1 loops, best of 3: 470 ms per loop
1 loops, best of 3: 391 ms per loop
1 loops, best of 3: 446 ms per loop
1 loops, best of 3: 3.59 s per loop
数据集 #2(更大的 A):
In [497]: L1 = 1000
...: L2 = 100
...: L3 = 100
...: al = 100
...:
...: A = np.random.rand(L1,al)
...: B = np.random.rand(al,L2)
...: C = np.random.rand(L3,al)
...:
In [498]: %timeit tensor_prod_v1(A,B,C)
...: %timeit tensor_prod_v2(A,B,C)
...: %timeit tensor_prod_v3(A,B,C)
...: %timeit np.einsum('ia,aj,ka->ijk', A, B, C)
...:
1 loops, best of 3: 442 ms per loop
1 loops, best of 3: 355 ms per loop
1 loops, best of 3: 303 ms per loop
1 loops, best of 3: 2.42 s per loop
数据集 #3(更大的 B):
In [500]: L1 = 100
...: L2 = 1000
...: L3 = 100
...: al = 100
...:
...: A = np.random.rand(L1,al)
...: B = np.random.rand(al,L2)
...: C = np.random.rand(L3,al)
...:
In [501]: %timeit tensor_prod_v1(A,B,C)
...: %timeit tensor_prod_v2(A,B,C)
...: %timeit tensor_prod_v3(A,B,C)
...: %timeit np.einsum('ia,aj,ka->ijk', A, B, C)
...:
1 loops, best of 3: 474 ms per loop
1 loops, best of 3: 247 ms per loop
1 loops, best of 3: 439 ms per loop
1 loops, best of 3: 2.26 s per loop
数据集 #4(更大的 C):
In [503]: L1 = 100
...: L2 = 100
...: L3 = 1000
...: al = 100
...:
...: A = np.random.rand(L1,al)
...: B = np.random.rand(al,L2)
...: C = np.random.rand(L3,al)
In [504]: %timeit tensor_prod_v1(A,B,C)
...: %timeit tensor_prod_v2(A,B,C)
...: %timeit tensor_prod_v3(A,B,C)
...: %timeit np.einsum('ia,aj,ka->ijk', A, B, C)
...:
1 loops, best of 3: 250 ms per loop
1 loops, best of 3: 358 ms per loop
1 loops, best of 3: 362 ms per loop
1 loops, best of 3: 2.46 s per loop
数据集 #5(更大的第一个维度长度):
In [506]: L1 = 100
...: L2 = 100
...: L3 = 100
...: al = 1000
...:
...: A = np.random.rand(L1,al)
...: B = np.random.rand(al,L2)
...: C = np.random.rand(L3,al)
...:
In [507]: %timeit tensor_prod_v1(A,B,C)
...: %timeit tensor_prod_v2(A,B,C)
...: %timeit tensor_prod_v3(A,B,C)
...: %timeit np.einsum('ia,aj,ka->ijk', A, B, C)
...:
1 loops, best of 3: 373 ms per loop
1 loops, best of 3: 269 ms per loop
1 loops, best of 3: 299 ms per loop
1 loops, best of 3: 2.38 s per loop
结论:我们看到 8x-10x 的加速比问题中列出的 all-einsum 方法的建议方法不同。