比较 Panda 数据框的记录答案

【问题标题】：Comparing records of Panda dataframe比较 Panda 数据框的记录
【发布时间】：2022-06-11 03:35:50
【问题描述】：

给定以下数据框：

df = pd.DataFrame(zip(*[np.random.randint(10, 25, size=100), 
                        np.random.randint(10, 25, size=100), 
                        np.random.randint(10, 25, size=100),
                        np.random.randint(10, 25, size=100)]), 
                  columns=list('ABCD'))

我需要一种最有效（更快）的方式来执行以下操作：

dd = df.to_dict(orient='index')

for k,v in dd.items():
    v['test'] = len([z['A'] for y,z in dd.items() 
                     if v['A'] > z['A']+3 
                     if v['B'] < z['B']/2])
    
pd.DataFrame.from_dict(dd,orient='index')

此代码有效，但在处理具有 >100k 行的 df 时需要很长时间。有没有更快的方法来达到同样的效果？

【问题讨论】：

您能用文字解释一下您要做什么吗？查看代码，您实际上是在进行笛卡尔比较，因此您将针对 100k 行，循环 10,000,000,000（100 亿）次...
对于此数据框中的每条记录，我想知道有多少记录符合上述条件。因此对于第 1 行，数据框中有多少条记录具有上述条件，以此类推。
你想使用df.apply()。如果你用英语提供你的条件逻辑，那么帮助会更容易
是的，请提供预期的输出，因为我“不明白”运行您的代码后得到的结果背后的逻辑。
对于第一条记录，如果'test'列等于10，则意味着在整个数据框中有10条记录： - 其A值+ 3低于A值first记录 - 并且它们的 B 值 /2 大于第一条记录的 B 值

标签： python pandas dataframe dictionary aggregate

【解决方案1】：

您正在将列中的每个项目与每个其他元素进行比较，这有很大的成本 - 行数呈二次方。我们可以在 pandas 中执行此操作，而不是使用 Python dicts，如下所示。这不是算法改进，所以它仍然可能很慢，但它应该会加快一个很大的常数因子。

正如您的问题所写，如果您有数千行，处理重复项是您可以做出的最大改进。

import pandas as pd
import numpy as np

size = 10000
df = pd.DataFrame(zip(*[np.random.randint(10, 25, size=size), 
                        np.random.randint(10, 25, size=size), 
                        np.random.randint(10, 25, size=size),
                        np.random.randint(10, 25, size=size)]), 
                  columns=list('ABCD'))


cols = ['A', 'B']

def conditional(row):
    return ((row.A > df['A'] + 3) & (row.B < df['B'] / 2)).sum()

# Use drop duplicates to deduplicate computation - only once for each A, B combination
# Use assign then apply to create a new column with the result of the
# conditional.
# test_counts has columns A, B, test.
test_counts = (
    df[cols].drop_duplicates()
    .assign(test=lambda dcols: dcols.apply(conditional, axis=1))
)

# Use merge and set_index to copy the deduplicated results
# to each occurrence of that A, B combination.
# set_index is for preserving order, remembering it from before the merge.
df2 = (pd.merge(df.reset_index(), test_counts, on=cols)
   .set_index('index').sort_index())

现在这个特定答案中的代码它具有有限数量的可能值（就像您的问题一样），所以我们这里没有二次复杂度，因为由于重复而减少了。但如果您有其他数据，情况可能会改变。

编辑添加

如果我们仔细观察我们每行调用一次的条件，我们可以用更少的工作让它做同样的事情：

【讨论】：