获取 pandas 中分类变量的映射答案

【问题标题】：Get mapping of categorical variables in pandas获取 pandas 中分类变量的映射
【发布时间】：2015-08-11 04:44:34
【问题描述】：

我这样做是为了让分类变量编号

>>> df = pd.DataFrame({'x':['good', 'bad', 'good', 'great']}, dtype='category')

       x
0   good
1    bad
2   good
3  great

如何获取原始值和新值之间的映射关系？

【问题讨论】：

【解决方案1】：

您可以通过枚举创建字典映射（类似于通过从列表索引创建字典键从列表创建字典）：

dict( enumerate(df['x'].cat.categories ) )

# {0: 'bad', 1: 'good', 2: 'great'}

或者，您可以在每行中映射值和代码：

dict( zip( df['x'].cat.codes, df['x'] ) )

# {0: 'bad', 1: 'good', 2: 'great'}

这里发生的事情更加透明，因此可以说更安全。它的效率也低得多，因为zip() 的参数长度是len(df)，而df['x'].cat.categories 的长度只是唯一值的计数，通常比len(df) 短得多。

方法 1 起作用的原因是类别具有索引类型：

type( df['x'].cat.categories )

# pandas.core.indexes.base.Index

在这种情况下，您可以像查找列表一样在索引中查找值。

有几种方法可以验证方法 1 是否有效。首先，您可以检查往返行程是否保留正确的值：

(df['x'] == df['x'].cat.codes.map( dict( 
            enumerate(df['x'].cat.categories) ) ).astype('category')).all()
# True

或者您可以检查方法 1 和方法 2 是否给出相同的答案：

(dict( enumerate(df['x'].cat.categories ) ) == dict( zip( df['x'].cat.codes, df['x'] ) ))

# True

【讨论】：

这是说这个 AttributeError: 'Series' object has no attribute 'cat'
你能说得更具体点吗？听起来您试图在不是分类的列上使用cat。您可以使用data.info() 检查数据类型，并且可以使用astype('category') 将几乎任何列转换为分类。

【解决方案2】：

如果你运行这个：

df["column_category"].cat.categories.get_loc("item")

它将返回对应于映射中“项目”的代码（例如 0）。

如果你运行这个：

df["column_category"].cat.categories[0]

它将返回对应于映射位置0的代码（例如“item”）的值

【讨论】：

【解决方案3】：

Hier 是我基于 Matheus Araujo 的回答的解决方案。

假设我们有一个国家/地区列。首先，您必须将列转换为分类数据类型：

df.country = df.country.astype('category')

以数组的形式获取每个值的代码：

df.country.cat.codes

将代码数组转换回字符串

df.country.cat.categories[df.country.cat.codes]

你也可以传递一个整数列表

df.country.cat.categories[[0, 1, 2]]

或单个代码

df.country.cat.categories[0]

【讨论】：