比较 2 列中的值并在 pandas 的第三列中输出结果答案

【问题标题】：Compare values in 2 columns and output the result in a third column in pandas比较 2 列中的值并在 pandas 的第三列中输出结果
【发布时间】：2016-05-31 07:19:06
【问题描述】：

我的数据如下所示，我正在尝试使用给定值创建列输出。

      a_id b_received c_consumed
  0    sam       soap        oil
  1    sam        oil        NaN
  2    sam      brush       soap
  3  harry        oil      shoes
  4  harry      shoes        oil
  5  alice       beer       eggs
  6  alice      brush      brush
  7  alice       eggs        NaN

生成数据集的代码是

df = pd.DataFrame({'a_id': 'sam sam sam harry harry alice alice alice'.split(),
               'b_received': 'soap oil brush oil shoes beer brush eggs'.split(),
               'c_consumed': 'oil NaN soap shoes oil eggs brush NaN'.split()})

我想要一个名为 Output 的新列，看起来像这样

      a_id b_received c_consumed   output
  0    sam       soap        oil   1
  1    sam        oil        NaN   1
  2    sam      brush       soap   0
  3  harry        oil      shoes   1
  4  harry      shoes        oil   1
  5  alice       beer       eggs   0
  6  alice      brush      brush   1 
  7  alice       eggs        NaN   1

所以搜索是如果 sam 收到肥皂、油和刷子，在“消耗”列中查找他消耗的产品的值，因此如果消耗肥皂，输出将为 1，但由于未消耗刷子，输出为0.

同样对于harry，他收到了油和鞋子，然后在消耗的列中寻找油和鞋子，如果油被消耗，则输出为1。

为了更清楚，输出值对应于第一列（接收），取决于第二列中存在的值（消耗）。

我尝试使用此代码

   a=[]
   for i in range(len(df.b_received)):
         if any(df.c_consumed == df.b_received[i] ):
              a.append(1)
         else:
              a.append(0)

   df['output']=a

这给了我输出

       a_id b_received c_consumed  output
  0    sam       soap        oil       1
  1    sam        oil        NaN       1
  2    sam      brush       soap       1
  3  harry        oil      shoes       1
  4  harry      shoes        oil       1
  5  alice       beer       eggs       0
  6  alice      brush      brush       1
  7  alice       eggs        NaN       1

问题在于，由于 sam 没有消耗刷子，所以输出应该是 0，但输出是 1，因为刷子是由另一个人（爱丽丝）消耗的。我需要确保不会发生这种情况。输出需要针对每个人的消费。

我知道这很令人困惑，所以如果我没有说得很清楚，请尽管问，我会回答你的 cmets。

【问题讨论】：

您应该包含到目前为止编写的代码以实现此目的。包含代码会很有帮助，这样其他人也可以复制和粘贴并创建数据框。
另外，谁消费了这个对象重要吗？
好的，我已经添加了重现数据集的代码，是的，谁在以后的操作中使用它很重要，我想计算每个用户在未来使用收到的项目的可能性。如果不是这样，我会简单地使用“查找”功能

标签： python pandas

【解决方案1】：

键是pandas.Series.isin()，它检查传递给pandas.Series.isin() 的对象中调用pandas.Series 中每个元素的成员资格。您想使用c_consumed 检查b_received 中每个元素的成员资格，但仅限于a_id 定义的每个组内。当groupby 与apply 一起使用时，pandas 将通过分组变量及其原始索引来索引对象。在您的情况下，您不需要索引中的分组变量，因此您可以使用 drop=True 将索引重置回原来的 reset_index。

df['output'] = (df.groupby('a_id')
               .apply(lambda x : x['b_received'].isin(x['c_consumed']).astype('i4'))
               .reset_index(level='a_id', drop=True))

您的DataFrame 现在是...

    a_id b_received c_consumed  output
0    sam       soap        oil       1
1    sam        oil        NaN       1
2    sam      brush       soap       0
3  harry        oil      shoes       1
4  harry      shoes        oil       1
5  alice       beer       eggs       0
6  alice      brush      brush       1
7  alice       eggs        NaN       1

查看split-apply-combine with pandas 的文档以获得更详尽的解释。

【讨论】：

谢谢你，我会研究一下 split apply combine 方法，看来我以后会有更多用处

【解决方案2】：

这应该可行，尽管理想的方法是 JaminSore 给出的方法

df['output'] = 0

ctr = 0

for names in df['a_id'].unique():
    for n, row in df.loc[df.a_id == names].iterrows():
        if row['b_received'] in df.loc[df.a_id == names]['c_consumed'].values:
            df.ix[ctr:]['output']=1
            ctr+=1
        else:
            df.ix[ctr:]['output']=0
            ctr+=1

现在的数据框

    a_id b_received c_consumed  output
0    sam       soap        oil       1
1    sam        oil        NaN       1
2    sam      brush       soap       0
3  harry        oil      shoes       1
4  harry      shoes        oil       1
5  alice       beer       eggs       0
6  alice      brush      brush       1
7  alice       eggs        NaN       1

【讨论】：