Python Pandas：为 2 个分类变量的独特组合创建变量？答案

【问题标题】：Python Pandas: create variable for unique combinations of 2 categorical variables?Python Pandas：为 2 个分类变量的独特组合创建变量？
【发布时间】：2020-10-16 17:19:33
【问题描述】：

假设我有一些数据：

df = pd.DataFrame({'location':['store','online','store','online','online'],
                  'item': ['apple','apple','orange','orange','orange']})
df
>>>

location    item
0   store   apple
1   online  apple
2   store   orange
3   online  orange
4   online  orange

您会注意到，有四种可能的变量组合：(store,apple)、(online,apple)、(store,orange)、(online,orange)。我想分配一个虚拟变量列。我天真的方法创建了四个虚拟变量，而我想要一个标签列：

pd.get_dummies(df,['location','item'])
>>>

location_online location_store  item_apple  item_orange
0   0   1   1   0
1   1   0   1   0
2   0   1   0   1
3   1   0   0   1
4   1   0   0   1

而我希望它看起来像：

df 
>>>
location    item   combination     dummy
0   store   apple  (store, apple)   0
1   online  apple  (online, apple)  1
2   store   orange (store, orange)  2
3   online  orange (online, orange) 3
4   online  orange (online, orange) 3

注意，dummy 只等于索引，因为只有 4 行。这显然不是普遍适用的。

Edit1：以上为回应评论而编辑。 Edit2：我添加了第 5 行来说明可以重复一行，但是，它应该具有与其副本相同的虚拟/组合。

【问题讨论】：

假人与哪对配对？例如，您希望 online-orange 与 3 等配对？
为什么不能使用索引本身？
@student，不，因为我提到的原因。如果超过 4 行怎么办？
没关系，刚刚看到问题的编辑。
@IoaTzimas，好点子！我为组合名称添加了一个列。但是，字典也可以。

标签： python pandas combinations

【解决方案1】：

我们试试agg:

df['combination'] = df[['location','item']].agg(tuple, axis=1)
df['dummy'] = df['combination'].factorize()[0]

输出：

  location    item       combination  dummy
0    store   apple    (store, apple)      0
1   online   apple   (online, apple)      1
2    store  orange   (store, orange)      2
3   online  orange  (online, orange)      3
4   online  orange  (online, orange)      3

如果你不关心combination，可以使用groupby.ngroup()：

df['dummy'] = df.groupby(['location','item'], sort=False).ngroup()

输出：

  location    item  dummy
0    store   apple      0
1   online   apple      1
2    store  orange      2
3   online  orange      3
4   online  orange      3

【讨论】：

@IoaTzimas 更新了我的答案。

【解决方案2】：

让我们通过连接location 和item 来创建combinations，然后使用factorize 对这些组合进行编码以获得dummy 变量：

df['combination'] = df['location'].add(', ' + df['item'])
df['dummy'] = df['combination'].factorize()[0]

  location    item     combination  dummy
0    store   apple    store, apple      0
1   online   apple   online, apple      1
2    store  orange   store, orange      2
3   online  orange  online, orange      3
4   online  orange  online, orange      3

【讨论】：

太棒了！似乎比使用 .apply() 方法更快。 .add() 可能被矢量化...？（那太好了！）

【解决方案3】：

您可以在前 2 列上应用 lambda 函数。见下文。 d 是一个字典，每对都有假人。

d={('store', 'apple'):0, ('online', 'apple'):1, ('store', 'orange'):2, ('online', 'orange'):3}

def f(x,y):
    return d[(x,y)]

df['dummy'] = df[['location','item']].apply(lambda x: f(*x), axis=1)

>>>print(df)
  location    item  dummy
0    store   apple      0
1   online   apple      1
2    store  orange      2
3   online  orange      3
4   online  orange      3

【讨论】：