洗牌大熊猫DataFrame的行并与一系列相关答案

【问题标题】：Shuffling rows of large pandas DataFrame and correlation with a series洗牌大熊猫DataFrame的行并与一系列相关
【发布时间】：2018-05-28 14:01:28
【问题描述】：

我需要将大熊猫数据帧的每一行独立洗牌几次（典型的形状是(10000,1000)），然后估计每一行与给定系列的相关性。

我发现留在 pandas 中的最有效（=快速）方法如下：

for i in range(N): #the larger is N, the better it is
    df_sh = df.apply(numpy.random.permutation, axis=1)
    #where df this is my large dataframe, with 10K rows and 1K columns

    corr = df_sh.corrwith(s, axis = 1)
    #where s is the provided series (shape of s =(1000,))

这两个任务花费的时间大致相同（即每个任务 30 秒）。我尝试将我的数据帧转换为numpy.array，以在数组上执行for 循环，并且对于每一行，我首先执行排列，然后测量与scipy.stats.pearsonr 的相关性。不幸的是，我设法将我的两项任务仅加快了 2 倍。还有其他可行的选择来加快任务的速度吗？（注意：我已经与Joblib 并行执行我的代码，直到我使用的机器允许的最大因子）。

【问题讨论】：

标签： performance pandas numpy permutation correlation

【解决方案1】：

二维矩阵/数组与一维数组/向量之间的相关性：

我们可以调整corr2_coeff_rowwise 用于2D 数组/矩阵和一维数组/向量之间的相关性，就像这样 -

def corr2_coeff_2d_1d(A, B):
    # Rowwise mean of input arrays & subtract from input arrays themeselves
    A_mA = A - A.mean(1,keepdims=1)
    B_mB = B - B.mean()

    # Sum of squares across rows
    ssA = np.einsum('ij,ij->i',A_mA,A_mA)
    ssB = B_mB.dot(B_mB)

    # Finally get corr coeff
    return A_mA.dot(B_mB)/np.sqrt(ssA*ssB)

要打乱每一行并对所有行执行此操作，我们可以使用np.random.shuffle。现在，这个 shuffle 函数沿第一个轴工作。所以，为了解决我们的问题，我们需要输入转置版本。另外，请注意，这种改组将在原地完成。因此，如果在其他地方需要原始数据框，请在处理前进行复制。因此，解决方案是 -

因此，让我们用它来解决我们的问题 -

# Extract underlying arry data for faster NumPy processing in loop later on    
a = df.values  
s_ar = s.values

# Setup array for row-indexing with NumPy's advanced indexing later on
r = np.arange(a.shape[0])[:,None]

for i in range(N):
    # Get shuffled indices per row with `rand+argsort/argpartition` trick from -
    # https://stackoverflow.com/a/45438143/
    idx = np.random.rand(*a.shape).argsort(1)

    # Shuffle array data with NumPy's advanced indexing
    shuffled_a = a[r, idx]

    # Compute correlation
    corr = corr2_coeff_2d_1d(shuffled_a, s_ar)

优化版本#1

现在，我们可以预先计算涉及在迭代之间保持不变的系列的部分。因此，进一步优化的版本将如下所示 -

a = df.values  
s_ar = s.values
r = np.arange(a.shape[0])[:,None]

B = s_ar
B_mB = B - B.mean()
ssB = B_mB.dot(B_mB)

A = a
A_mean = A.mean(1,keepdims=1)

for i in range(N):
    # Get shuffled indices per row with `rand+argsort/argpartition` trick from -
    # https://stackoverflow.com/a/45438143/
    idx = np.random.rand(*a.shape).argsort(1)

    # Shuffle array data with NumPy's advanced indexing
    shuffled_a = a[r, idx]

    # Compute correlation
    A = shuffled_a
    A_mA = A - A_mean
    ssA = np.einsum('ij,ij->i',A_mA,A_mA)
    corr = A_mA.dot(B_mB)/np.sqrt(ssA*ssB)

基准测试

使用实际用例形状/大小设置输入

In [302]: df = pd.DataFrame(np.random.rand(10000,1000))

In [303]: s = pd.Series(df.iloc[0])

1.原始方法

In [304]: %%timeit
     ...: df_sh = df.apply(np.random.permutation, axis=1)
     ...: corr = df_sh.corrwith(s, axis = 1)
1 loop, best of 3: 1.99 s per loop

2。建议的方法

预处理部分（仅在开始循环之前完成一次，因此不包括计时）-

In [305]: a = df.values  
     ...: s_ar = s.values
     ...: r = np.arange(a.shape[0])[:,None]
     ...: 
     ...: B = s_ar
     ...: B_mB = B - B.mean()
     ...: ssB = B_mB.dot(B_mB)
     ...: 
     ...: A = a
     ...: A_mean = A.mean(1,keepdims=1)

部分建议的解决方案循环运行 -

In [306]: %%timeit
     ...: idx = np.random.rand(*a.shape).argsort(1)
     ...: shuffled_a = a[r, idx]
     ...: 
     ...: A = shuffled_a
     ...: A_mA = A - A_mean
     ...: ssA = np.einsum('ij,ij->i',A_mA,A_mA)
     ...: corr = A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
1 loop, best of 3: 675 ms per loop

因此，我们在这里看到了大约 3x 的加速！

【讨论】：

明天我会测试它并告诉你，但这听起来很有希望，我没有想过随机化矩阵的索引而不是条目。
一开始我也选择了np.random.shuffle(a.T)，但不幸的是我意识到它并没有独立地随机化每一行，即如果第 k 行的第 i 个条目移动到位置 j 然后第 h 行的位置 i 将移动到相同的位置，这就是为什么我使用您的符号在 a 上循环。我找不到避免执行此类循环的方法。
这就是为什么我喜欢你的原始答案，即随机化索引。