计算 pandas 数据框中的值并使用这些值创建子数据框答案

【问题标题】：Count values in a pandas dataframe and use those some values for creating a subdataframe计算 pandas 数据框中的值并使用这些值创建子数据框
【发布时间】：2022-01-20 19:21:59
【问题描述】：

我有一个熊猫数据框。我想计算一列中的所有值，以了解其中哪些是重复的。然后，我想只提取重复的值，我想用它们来创建一个子数据框。

举个例子吧。说这是我的数据框：

df =

    type        color       name
0   fruit       red         apple
1   fruit       yellow      banana
2   meat        brown       steak
3   fruit       green       apple
4   fruit       orange      orange
5   veg         orange      carrot
6   fruit       yellow      apple
7   meat        brown       steak
8   veg         orange      carrot

我想知道“名称”列中是否有任何重复值。为此，我使用这行代码：

df['name'].value_counts().loc[lambda x : x>1]

这就是我得到的：

apple   3
steak   2
carrot  2

然后，我想创建一个子数据框，用“apple”、“steak”、“carrot”过滤“name”列，以找到与另一列相关的值。当然，这可以通过适当的函数来完成。

想要的输出是：

sub_df =

    type        color       name
0   fruit       red         apple
1   fruit       green       apple
2   fruit       yellow      apple
3   meat        steak       brown
4   meat        steak       brown
5   veg         orange      carrot
6   veg         orange      carrot

我尝试了不同类型的代码，但没有成功。我认为问题出在 df.count_values() 的使用上，因为它给了我一个带有出现次数的 Pandas 系列，而无法访问该方法计数的值。

有什么建议吗？

【问题讨论】：

标签： python pandas distinct-values

【解决方案1】：

你不需要分两步做，这里是如何使用groupby和filter来达到最终的效果：

df.groupby('name').filter(lambda g: g['type'].count() > 1).sort_values('name')

输出：


    type    color   name
0   fruit   red     apple
3   fruit   green   apple
6   fruit   yellow  apple
5   veg     orange  carrot
8   veg     orange  carrot
2   meat    brown   steak
7   meat    brown   steak

【讨论】：

谢谢。一步到位！ :)

【解决方案2】：

下次请提供更好的测试数据（要复制粘贴的数据）。

我认为你想要的输出是错误的，因为color 列中有一个steakvalue。

我已经尝试了以下应该可以满足您的需求。我想你理解代码，我只添加了以下行：

df[df["name"].isin(y.index.tolist())]

它在数据框的 name 列中搜索系列索引值的所有值 (isin)。如果您想拥有一个具有自己索引的完整新数据框，可以将.reset_index() 添加到上述行。

import pandas as pd

df = pd.DataFrame([
    ["fruit", "red", "apple"],
    ["fruit", "yellow", "banana"],
    ["meat", "brown", "steak"],
    ["fruit", "green", "apple"],
    ["fruit", "orange", "orange"],
    ["veg", "orange", "carrot"],
    ["fruit", "yellow", "apple"],
    ["meat", "brown", "steak"],
    ["veg", "orange", "carrot"]
],
    columns=["type", "color", "name"])

print(df)

y = df['name'].value_counts().loc[lambda x: x > 1]

print(y)

df_2 = df[df["name"].isin(y.index.tolist())]

print(df_2)

输出：

    type   color    name
0  fruit     red   apple
1  fruit  yellow  banana
2   meat   brown   steak
3  fruit   green   apple
4  fruit  orange  orange
5    veg  orange  carrot
6  fruit  yellow   apple
7   meat   brown   steak
8    veg  orange  carrot
apple     3
steak     2
carrot    2
Name: name, dtype: int64
    type   color    name
0  fruit     red   apple
2   meat   brown   steak
3  fruit   green   apple
5    veg  orange  carrot
6  fruit  yellow   apple
7   meat   brown   steak
8    veg  orange  carrot

【讨论】：

非常感谢，输出正确。抱歉提供的数据：下次我会提供更好的。确实是坦克。