【发布时间】:2020-08-29 14:46:50
【问题描述】:
给定一个数据框如下:
latitude longitude user_service
0 -27.496404 153.014353 02: Duhig Tower
1 -27.497107 153.014836 NaN
2 -27.497118 153.014890 NaN
3 -27.497154 153.014813 NaN
4 -27.496437 153.014477 12: Duhig North
5 -27.497156 153.014813 32: Gordon Greenwod
6 -27.497097 153.014746 23: Abel Smith
7 -27.496390 153.014415 32: Gordon Greenwod
8 -27.497112 153.014780 03: Steele
9 -27.497156 153.014813 32: Gordon Greenwod
10 -27.496487 153.014622 02: Duhig Tower
11 -27.497075 153.014532 NaN
12 -27.497103 153.014817 25: UQ Sports
13 -27.496754 153.014504 02: Duhig Tower
14 -27.496567 153.014294 02: Duhig Tower
15 -27.497156 153.014813 32: Gordon Greenwod
由于user_service 列有缺失值,所以我想也许我可以使用空间聚类方法来填充 nans。
例如,对于第二行的latitude 和longitude 对-27.497107, 153.014836,如果02: Duhig Tower 的位置距离它最近,那么我想在@987654327 中填写nan @ 和 02: Duhig Tower 代表这一行。其他缺失行的逻辑相同。
如何在 Python 中实现上述逻辑?谢谢。
Guillermo Mosse 解决方案的输出,但还是有一些NaNs:
latitude longitude user_service
0 -27.499012 153.015180 51: Zelman Cowen
1 -27.497600 153.014479 03: Steele
2 -27.500054 153.013435 50: Hawken Engineering
3 -27.495979 153.009834 NaN
4 -27.496748 153.017507 32: Gordon Greenwod
5 -27.495695 153.016178 38: UQ Multi Faith Chaplaincy
6 -27.497015 153.012492 01: Forgan Smith
7 -27.498797 153.017267 NaN
8 -27.500508 153.011360 75: AIBN
9 -27.496763 153.013795 01: Forgan Smith
10 -27.494909 153.017187 NaN
11 -27.496384 153.013810 12: Duhig North
查看NaNs:
var = df.loc[[2]].user_service
print(var)
print(type(var))
print(len(var))
输出:
2 NaN
Name: user_service, dtype: object
<class 'pandas.core.series.Series'>
1
【问题讨论】:
标签: python-3.x pandas scikit-learn k-means dbscan