二值化熊猫数据框列答案

【问题标题】：Binarizing pandas dataframe column二值化熊猫数据框列
【发布时间】：2018-01-27 11:39:42
【问题描述】：

mean radius mean texture    mean perimeter  mean area   mean smoothness mean compactness    mean concavity  mean concave points mean symmetry   mean fractal dimension  ... worst texture   worst perimeter worst area  worst smoothness    worst compactness   worst concavity worst concave points    worst symmetry  worst fractal dimension classification
0   17.99   10.38   122.80  1001.0  0.11840 0.27760 0.3001  0.14710 0.2419  0.07871 ... 17.33   184.60  2019.0  0.1622  0.6656  0.7119  0.2654  0.4601  0.11890 0
1   20.57   17.77   132.90  1326.0  0.08474 0.07864 0.0869  0.07017 0.1812  0.05667 ... 23.41   158.80  1956.0  0.1238  0.1866  0.2416  0.1860  0.2750  0.08902 0
2   19.69   21.25   130.00  1203.0  0.10960 0.15990 0.1974  0.12790 0.2069  0.05999 ... 25.53   152.50  1709.0  0.1444  0.4245  0.4504  0.2430  0.3613  0.08758 0
3   11.42   20.38   77.58   386.1   0.14250 0.28390 0.2414  0.10520 0.2597  0.09744 ... 26.50   98.87   567.7   0.2098  0.8663  0.6869  0.2575  0.6638  0.17300 0
4   20.29   14.34   135.10  1297.0  0.10030 0.13280 0.1980  0.10430 0.1809  0.05883 ... 16.67   152.20  1575.0  0.1374  0.2050  0.4000  0.1625  0.2364  0.07678 0

假设我有一个类似于上面的 pandas 数据框。如果值高于12.0，我想对mean radius 列进行二值化（更改为0 或1）。

我试过的是

data_df.loc[data_df["mean radius"] > 12.0] = 0

但这给了我一个奇怪的结果。

我该如何解决这个问题？

【问题讨论】：

标签： python pandas

【解决方案1】：

如果您想将整列更改为 1 和 0，您可以将代码稍微修改为：

# 0 if greater than 12, 1 otherwise
data_df["mean_radius"] = (data_df["mean radius"] <= 12.0).astype(int)

如果您只想将半径大于 12 的列更改为 0（保持小于 12 的值不变）：

# only change the values > 12
# this method is discouraged, see edit below
data_df[data_df["mean radius"] > 12.0]["mean radius"] = 0

编辑

正如@jp_data_analysis 指出的那样，chained indexing is discouraged。进行第二个操作的首选方法是多轴分度，此处转载自以下this answer：

# only change the values > 12
data_df.loc[data_df["mean radius"] > 12.0, "mean radius"] = 0

【讨论】：

感谢您的回答。如果我想将二值化结果存储到新列中，我只需要做data_df['new column'] = data_df[column_name] < cutoff 吗？
没错。但是，如果您想要 1 和 0，您可能必须致电 .astype(int)（请参阅我的更新）。 data_df[column_name] < cutoff 将返回布尔值（True 和 False）。
Boolean -> int 系列方法很好。但我不鼓励链式索引（请参阅stackoverflow.com/a/41253181/9209546）。
@jp_data_analysis 感谢您提供信息！我会编辑帖子。

【解决方案2】：

同时指定列，如下所示：

data_df.loc[data_df["mean radius"] > 12.0, "mean radius"] = 0

【讨论】：

如果我想将相反的情况（小于 12.0）设为 1，我只需要写一个不同条件的新行吗？
@Dawn17 是的。或者，将其设置为 1 开始并指定 0 条件。但前提是这两种情况涵盖所有情况（例如，没有 NaN）。

【解决方案3】：

通过使用mask

data_df["mean radius"]=data_df["mean radius"].mask(data_df["mean radius"] > 12.0,0)

【讨论】：