二维矩阵/数组与一维数组/向量之间的相关性:
我们可以调整corr2_coeff_rowwise 用于2D 数组/矩阵和一维数组/向量之间的相关性,就像这样 -
def corr2_coeff_2d_1d(A, B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(1,keepdims=1)
B_mB = B - B.mean()
# Sum of squares across rows
ssA = np.einsum('ij,ij->i',A_mA,A_mA)
ssB = B_mB.dot(B_mB)
# Finally get corr coeff
return A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
要打乱每一行并对所有行执行此操作,我们可以使用np.random.shuffle。现在,这个 shuffle 函数沿第一个轴工作。所以,为了解决我们的问题,我们需要输入转置版本。另外,请注意,这种改组将在原地完成。因此,如果在其他地方需要原始数据框,请在处理前进行复制。因此,解决方案是 -
因此,让我们用它来解决我们的问题 -
# Extract underlying arry data for faster NumPy processing in loop later on
a = df.values
s_ar = s.values
# Setup array for row-indexing with NumPy's advanced indexing later on
r = np.arange(a.shape[0])[:,None]
for i in range(N):
# Get shuffled indices per row with `rand+argsort/argpartition` trick from -
# https://stackoverflow.com/a/45438143/
idx = np.random.rand(*a.shape).argsort(1)
# Shuffle array data with NumPy's advanced indexing
shuffled_a = a[r, idx]
# Compute correlation
corr = corr2_coeff_2d_1d(shuffled_a, s_ar)
优化版本#1
现在,我们可以预先计算涉及在迭代之间保持不变的系列的部分。因此,进一步优化的版本将如下所示 -
a = df.values
s_ar = s.values
r = np.arange(a.shape[0])[:,None]
B = s_ar
B_mB = B - B.mean()
ssB = B_mB.dot(B_mB)
A = a
A_mean = A.mean(1,keepdims=1)
for i in range(N):
# Get shuffled indices per row with `rand+argsort/argpartition` trick from -
# https://stackoverflow.com/a/45438143/
idx = np.random.rand(*a.shape).argsort(1)
# Shuffle array data with NumPy's advanced indexing
shuffled_a = a[r, idx]
# Compute correlation
A = shuffled_a
A_mA = A - A_mean
ssA = np.einsum('ij,ij->i',A_mA,A_mA)
corr = A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
基准测试
使用实际用例形状/大小设置输入
In [302]: df = pd.DataFrame(np.random.rand(10000,1000))
In [303]: s = pd.Series(df.iloc[0])
1.原始方法
In [304]: %%timeit
...: df_sh = df.apply(np.random.permutation, axis=1)
...: corr = df_sh.corrwith(s, axis = 1)
1 loop, best of 3: 1.99 s per loop
2。建议的方法
预处理部分(仅在开始循环之前完成一次,因此不包括计时)-
In [305]: a = df.values
...: s_ar = s.values
...: r = np.arange(a.shape[0])[:,None]
...:
...: B = s_ar
...: B_mB = B - B.mean()
...: ssB = B_mB.dot(B_mB)
...:
...: A = a
...: A_mean = A.mean(1,keepdims=1)
部分建议的解决方案循环运行 -
In [306]: %%timeit
...: idx = np.random.rand(*a.shape).argsort(1)
...: shuffled_a = a[r, idx]
...:
...: A = shuffled_a
...: A_mA = A - A_mean
...: ssA = np.einsum('ij,ij->i',A_mA,A_mA)
...: corr = A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
1 loop, best of 3: 675 ms per loop
因此,我们在这里看到了大约 3x 的加速!