将“计数”列添加到 Pandas DataFrame答案

【问题标题】：Adding 'counts' column to Pandas DataFrame将“计数”列添加到 Pandas DataFrame
【发布时间】：2020-12-03 06:41:41
【问题描述】：

我有两个 Dataframe，我将它们称为 frame1 和 frame2。 frame1 是较小的框架，并且有一个 id 列，其中每个 id 都是唯一的。 frame2 较大并且具有完全相同的 id 列，但许多 id 是重复的。然而，frame2 中唯一 id 的数量与 frame1 中的行数相同...也就是说，frame2 中的每个 id 都存在于 frame1 中。

我要做的是向 frame1 添加一个“计数”列，其中包含与 frame2 中的每个 id 关联的唯一指标类别的数量。

下面是框架的样子：

所以，我想将“unique_metric_counts”列添加到 frame1，其中 id 1 为“3”，id 2 为“2”，id 3 为“1”。

我对如何做到这一点有一个很好的想法，问题是我相信它在我的 Jupyter Notebook 中占用了大量内存并且永远不会完成运行，因为它是一个非常大的框架上的 for 循环，在 for 循环中，我正在创建更大框架的临时框架。

这是我的代码：

frame1['unique_metric_counts'] = None

for x in frame1['id']:
    tempframe = frame2[frame2['id'].isin([x])]
    numUnique = tempframe['metric categories'].nunique()
    frame1['unique_metric_counts'] = numUnique

我相当肯定这段代码理论上应该可以工作，但是我的数据框绝对是庞大的，我认为不使用 for 循环而只使用 Pandas 功能会更好。非常感谢任何帮助。

【问题讨论】：

标签： python pandas dataframe

【解决方案1】：

这是一种方法，它从第 2 帧中的“metric_categories”值创建查找，然后映射到第 1 帧中与查找键中的 id 对应的行：

# Create dataframes

ids = [1,2,3,4,5]
feat = ['aa','bb','cc','dd','ee']

df1 = pd.DataFrame({'id':ids, 'some_feature':feat})

ids = [1,1,1,2,2,2,2,3,3,3]
feat = ['aa','aa','aa','bb','bb','bb','bb','cc','cc','cc']
metrics = ['metric x', 'metric y', 'metric z','metric x', 'metric x','metric z', 'metric z','metric x', 'metric x', 'metric x']

df2 = pd.DataFrame({'id':ids, 'some_feature':feat, 'metric_categories':metrics})


# Create mapping (a dictionary where keys are the 'id' and values are the length of the set of the 'metric categories'):
 
mapping = df2.groupby('id')['metric_categories'].agg(lambda x: len(set(x))).to_dict()

mapping
# >>> {1: 3, 2: 2, 3: 1}


# Map dictionary to 'id' column in df1:

df1['unique_metric_counts'] = df1['id'].map(mapping)

print(df1)

#    id     some_feature    unique_metric_counts
# 0   1           aa                   3.0
# 1   2           bb                   2.0
# 2   3           cc                   1.0
# 3   4           dd                   NaN
# 4   5           ee                   NaN

【讨论】：

如果 df1 中的每个 id 在字典中都有一个键，那么您将不会在 unique_metric_counts 中获得 NaN，然后 pandas 就会知道数据类型是 int（而不是像 float如上图所示）