根据其他文本列将数字列添加到熊猫数据框[重复]答案

【问题标题】：Add numeric column to pandas dataframe based on other textual column [duplicate]根据其他文本列将数字列添加到熊猫数据框[重复]
【发布时间】：2018-01-06 09:52:20
【问题描述】：

我有这个数据框：

df = pd.DataFrame([['137', 'earn'], ['158', 'earn'],['144', 'ship'],['111', 'trade'],['132', 'trade']], columns=['value', 'topic'] )
print(df)
    value  topic
0   137   earn
1   158   earn
2   144   ship
3   111  trade
4   132  trade

我想要一个额外的数字列，如下所示：

    value  topic  topic_id
0   137   earn    0
1   158   earn    0
2   144   ship    1
3   111  trade    2
4   132  trade    2

所以基本上我想生成一个将字符串列编码为数值的列。我实现了这个解决方案：

topics_dict = {}
topics = np.unique(df['topic']).tolist()
for i in range(len(topics)):
        topics_dict[topics[i]] = i
df['topic_id'] = [topics_dict[l] for l in df['topic']]

但是，我很确定有一种更优雅、更流行的方法来解决这个问题，但我在 Google 或 SO 上找不到任何东西。我读到了 pandas 的 get_dummies 但这会为原始列中的每个不同值创建多个列。

感谢任何帮助或指示方向！

【问题讨论】：

标签： python pandas

【解决方案1】：

选项 1
pd.factorize

df['topic_id'] = pd.factorize(df.topic)[0]
df

  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2

选项 2
np.unique

_, v = np.unique(df.topic, return_inverse=True)
df['topic_id'] = v

df

  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2

选项 3
pd.Categorical

df['topic_id'] = pd.Categorical(df.topic).codes
df

  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2

选项 4
dfGroupBy.ngroup

df['topic_id'] = df.groupby('topic').ngroup()
df

  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2

【讨论】：

非常有用，谢谢。由于我的声誉缺失，我无法投票
@T.Beck 我以为你的意思是先接受我的，因为那是你所做的 :-) 这就是为什么我指出，当你接受某人的回答，然后接受别人的时候，另一个接受是撤消。如果您打算勾选该答案，那很好。但是，如果您打算打勾，请务必意识到它已被撤消。
刚刚得到它;)
df.groupby('topic').ngroup() 不适用于 python3。错误是：AttributeError: 'DataFrameGroupBy' object has no attribute 'ngroup'
@rnso 更新到新版本。

【解决方案2】：

你可以使用

In [63]: df['topic'].astype('category').cat.codes
Out[63]:
0    0
1    0
2    1
3    2
4    2
dtype: int8

【讨论】：

之前偶然发现了类别，但没想过简单地转换它。不错！

【解决方案3】：

我们可以使用apply函数在现有列的基础上创建新列，如下所示。

topic_list = list(df["topic"].unique()) df['topic_id'] = df.apply(lambda row: topic_list.index(row["topic"]),axis=1)

【讨论】：

【解决方案4】：

可以使用for 循环和列表推导来确定代码列表：

ucols = pd.unique(df.topic)
df['topic_id'] = [ j
                for i in range(len(df.topic))
                for j in range(len(ucols))
                if df.topic[i] == ucols[j]  ]
print(df)

输出：

  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2

【讨论】：

【解决方案5】：

试试这个代码

 df['topic_id'] = pd.Series([0,0,1,2,2], index=df.index)

效果不错

   value  topic
0   137   earn
1   158   earn
2   144   ship
3   111  trade
4   132  trade
  value  topic  topic_id
0   137   earn         0
1   158   earn         0
2   144   ship         1
3   111  trade         2
4   132  trade         2

【讨论】：

如果你有一百万行，祝你好运。
我们可以修改[0,0,1,2,2]中的东西，可以是随机序列，也可以是任意列表。