Sklearn将字符串类标签更改为int答案

【问题标题】：Sklearn changing string class label to intSklearn将字符串类标签更改为int
【发布时间】：2017-07-08 07:57:41
【问题描述】：

我有一个 pandas 数据框，我正在尝试将给定列中由字符串表示的值更改为整数。例如：

df = index    fruit   quantity   price 
         0    apple          5    0.99
         1    apple          2    0.99
         2   orange          4    0.89
         4   banana          1    1.64
       ...
     10023     kiwi         10    0.92

我想看看：

df = index    fruit   quantity   price 
         0        1          5    0.99
         1        1          2    0.99
         2        2          4    0.89
         4        3          1    1.64
       ...
     10023        5         10    0.92

我可以使用

df["fruit"] = df["fruit"].map({"apple": 1, "orange": 2,...})

如果我有一个小列表要更改，这很有效，但我正在查看一个包含 500 多个不同标签的列。有什么方法可以将其从 string 更改为 int？

【问题讨论】：

标签： python pandas scikit-learn

【解决方案1】：

你可以使用sklearn.preprocessing

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(df.fruit)
df['categorical_label'] = le.transform(df.fruit)

将标签转换回原始编码。

le.inverse_transform(df['categorical_label'])

【讨论】：

【解决方案2】：

使用factorize，然后在必要时转换为categorical：

df.fruit = pd.factorize(df.fruit)[0]
print (df)
   fruit  quantity  price
0      0         5   0.99
1      0         2   0.99
2      1         4   0.89
3      2         1   1.64
4      3        10   0.92

df.fruit = pd.Categorical(pd.factorize(df.fruit)[0])
print (df)
  fruit  quantity  price
0     0         5   0.99
1     0         2   0.99
2     1         4   0.89
3     2         1   1.64
4     3        10   0.92

print (df.dtypes)
fruit       category
quantity       int64
price        float64
dtype: object

如果需要的话也可以从1算起：

df.fruit = pd.Categorical(pd.factorize(df.fruit)[0] + 1)
print (df)
  fruit  quantity  price
0     1         5   0.99
1     1         2   0.99
2     2         4   0.89
3     3         1   1.64
4     4        10   0.92

【讨论】：

分类按定义分解；没有理由直接这样做
@Jeff - 我不明白 - 你认为 factorize 的输出是 category 设计的吗？ print (type(pd.factorize(pd.Series(['apple','apple','orange', 'banana']))[0])) return numpy array 和 docs（最后注）描述了如何转换为分类 - 它似乎在 factorize 之后。还是缺少什么？谢谢。
您根本不需要分解，只需转换为类别并使用代码；这些是分解：不需要直接使用分解

【解决方案3】：

你可以使用factorize方法：

In [13]: df['fruit'] = pd.factorize(df['fruit'])[0].astype(np.uint16)

In [14]: df
Out[14]:
   index  fruit  quantity  price
0      0      0         5   0.99
1      1      0         2   0.99
2      2      1         4   0.89
3      4      2         1   1.64
4  10023      3        10   0.92

In [15]: df.dtypes
Out[15]:
index         int64
fruit        uint16
quantity      int64
price       float64
dtype: object

您也可以这样做：

In [21]: df['fruit'] = df.fruit.astype('category').cat.codes

In [22]: df
Out[22]:
   index  fruit  quantity  price
0      0      0         5   0.99
1      1      0         2   0.99
2      2      3         4   0.89
3      4      1         1   1.64
4  10023      2        10   0.92

In [23]: df.dtypes
Out[23]:
index         int64
fruit          int8
quantity      int64
price       float64
dtype: object

【讨论】：