优化两个 Pandas Dataframe 之间的笛卡尔积答案

【问题标题】：Optimizing cartesian product between two Pandas Dataframe优化两个 Pandas Dataframe 之间的笛卡尔积
【发布时间】：2020-05-08 00:14:08
【问题描述】：

我有两个具有相同列的数据框：

数据框 1：

          attr_1  attr_77 ... attr_8
userID                              
John      1.2501  2.4196  ... 1.7610
Charles   0.0000  1.0618  ... 1.4813
Genarito  2.7037  4.6707  ... 5.3583
Mark      9.2775  6.7638  ... 6.0071

数据框 2：

          attr_1  attr_77 ... attr_8
petID                              
Firulais  1.2501  2.4196  ... 1.7610
Connie    0.0000  1.0618  ... 1.4813
PopCorn   2.7037  4.6707  ... 5.3583

我想生成所有可能组合的相关性和 p 值数据框，结果如下：

   userId   petID      Correlation    p-value
0  John     Firulais   0.091447       1.222927e-02
1  John     Connie     0.101687       5.313359e-03
2  John     PopCorn    0.178965       8.103919e-07
3  Charles  Firulais   -0.078460      3.167896e-02

问题是笛卡尔积生成了超过 300 万个元组。需要几分钟才能完成。这是我的代码，我写了两个替代方案：

首先，初始数据帧：

df1 = pd.DataFrame({
    'userID': ['John', 'Charles', 'Genarito', 'Mark'],
    'attr_1': [1.2501, 0.0, 2.7037, 9.2775],
    'attr_77': [2.4196, 1.0618, 4.6707, 6.7638],
    'attr_8': [1.7610, 1.4813, 5.3583, 6.0071]
}).set_index('userID')

df2 = pd.DataFrame({
    'petID': ['Firulais', 'Connie', 'PopCorn'],
    'attr_1': [1.2501, 0.0, 2.7037],
    'attr_77': [2.4196, 1.0618, 4.6707],
    'attr_8': [1.7610, 1.4813, 5.3583]
}).set_index('petID')

选项 1：

# Pre-allocate space
df1_keys = df1.index
res_row_count = len(df1_keys) * df2.values.shape[0]
genes = np.empty(res_row_count, dtype='object')
mature_mirnas = np.empty(res_row_count, dtype='object')
coff = np.empty(res_row_count)
p_value = np.empty(res_row_count)

i = 0
for df1_key in df1_keys:
    df1_values = df1.loc[df1_key, :].values
    for df2_key in df2.index:
        df2_values = df2.loc[df2_key, :]
        pearson_res = pearsonr(df1_values, df2_values)

        users[i] = df1_key
        pets[i] = df2_key
        coff[i] = pearson_res[0]
        p_value[i] = pearson_res[1]
        i += 1

# After loop, creates the resulting Dataframe
return pd.DataFrame(data={
    'userID': users,
    'petID': pets,
    'Correlation': coff,
    'p-value': p_value
})

选项 2 ~~（较慢）~~，来自here：

# Makes a merge between all the tuples
def df_crossjoin(df1_file_path, df2_file_path):
    df1, df2 = prepare_df(df1_file_path, df2_file_path)

    df1['_tmpkey'] = 1
    df2['_tmpkey'] = 1

    res = pd.merge(df1, df2, on='_tmpkey').drop('_tmpkey', axis=1)
    res.index = pd.MultiIndex.from_product((df1.index, df2.index))

    df1.drop('_tmpkey', axis=1, inplace=True)
    df2.drop('_tmpkey', axis=1, inplace=True)

    return res

# Computes Pearson Coefficient for all the tuples
def compute_pearson(row):
    values = np.split(row.values, 2)
    return pearsonr(values[0], values[1])

result = df_crossjoin(mrna_file, mirna_file).apply(compute_pearson, axis=1)

有没有更快的方法来解决 Pandas 的此类问题？或者我除了并行化迭代别无选择？

编辑：

随着数据框大小的增加第二个选项会产生更好的运行时间，但仍然需要几秒钟才能完成。

提前致谢

【问题讨论】：

您的问题表述得很好，唯一缺少的是带有pd.DataFrame 的两个数据框，因此我们可以立即运行代码并获得与您相同的结果。现在您的列已被截断，因此问题无法重现。
您可能想要类似：df1.corrwith(df2, axis = 1) 但数据会有所帮助
感谢您的关注！源是包含技术信息的巨大 CSV。我编辑了问题以添加可重现的示例
选项 2 中哪个部分较慢：df_crossjoin 或 compute_pearson？
我不太确定。也许做一个交叉连接，然后像apply 这样的线性操作会使事情（自然地）比只计算整个数据集慢一次。由于 Pandas 不并行化，因此循环实现是一种可行的替代方案

标签： python python-3.x pandas

【解决方案1】：

这是另一种使用相同交叉连接但使用内置 pandas 方法 DataFrame.corrwith 和 scipy.stats.ttest_ind 的方法。由于我们使用较少的“循环”实现，这应该会表现得更好。

from scipy.stats import ttest_ind

mrg = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop(columns='key')

x = mrg.filter(like='_x').rename(columns=lambda x: x.rsplit('_', 1)[0])
y = mrg.filter(like='_y').rename(columns=lambda x: x.rsplit('_', 1)[0])

df = mrg[['userID', 'petID']].join(x.corrwith(y, axis=1).rename('Correlation'))

df['p_value'] = ttest_ind(x, y, axis=1)[1]

      userID     petID  Correlation   p_value
0       John  Firulais     1.000000  1.000000
1       John    Connie     0.641240  0.158341
2       John   PopCorn     0.661040  0.048041
3    Charles  Firulais     0.641240  0.158341
4    Charles    Connie     1.000000  1.000000
5    Charles   PopCorn     0.999660  0.020211
6   Genarito  Firulais     0.661040  0.048041
7   Genarito    Connie     0.999660  0.020211
8   Genarito   PopCorn     1.000000  1.000000
9       Mark  Firulais    -0.682794  0.006080
10      Mark    Connie    -0.998462  0.003865
11      Mark   PopCorn    -0.999569  0.070639

【讨论】：

感谢您的回答。不幸的是它抛出了一个错误：None of [Index(['geneID', 'petID'], dtype='object')] are in the [columns]
检查mrg 数据框中的列名。
对不起！我的错。我已经编辑了问题，我忘了设置索引。 userID 和 petID 是索引，而不是列。别担心，我会尝试弄清楚如何使您的代码适应这种特殊情况。谢谢！
不幸的是，在 100 行的 2 个数据帧之间有一个交叉连接，这是较慢的选择：选项 1 -> 3.4234 秒。选项 2 -> 2.3864。此选项-> 9.8907。还是谢谢你！

【解决方案2】：

在所有测试的替代方案中，给我最好结果的一个是：

迭代产品是用 itertools.product().
两个 iterrows 上的所有迭代都在并行进程（使用map 函数）。

为了提高性能，函数 compute_row_cython 是用 Cython 编译的，正如 Pandas 文档的 this section 中所建议的那样：

在cython_modules.pyx 文件中：

from scipy.stats import pearsonr
import numpy as np

def compute_row_cython(row):
    (df1_key, df1_values), (df2_key, df2_values) = row
    cdef (double, double) pearsonr_res = pearsonr(df1_values.values, df2_values.values)
    return df1_key, df2_key, pearsonr_res[0], pearsonr_res[1]

然后我设置setup.py:

from distutils.core import setup
from Cython.Build import cythonize

setup(name='Compiled Pearson',
      ext_modules=cythonize("cython_modules.pyx")

最后我编译成：python setup.py build_ext --inplace

留下最终代码，然后：

import itertools
import multiprocessing
from cython_modules import compute_row_cython

NUM_CORES = multiprocessing.cpu_count() - 1

pool = multiprocessing.Pool(NUM_CORES)
# Calls to Cython function defined in cython_modules.pyx
res = zip(*pool.map(compute_row_cython, itertools.product(df1.iterrows(), df2.iterrows()))
pool.close()
end_values = list(res)
pool.join()

Dask 和带有apply 的merge 函数都没有给我更好的结果。甚至没有使用 Cython 优化应用。事实上，这两种方法的替代方案给了我内存错误，当使用 Dask 实现解决方案时，我必须生成多个分区，这会降低性能，因为它必须执行许多 I/O 操作。

Dask 的解决方案可以在我的other question 中找到。

【讨论】：