Python中基于空间聚类方法填充缺失值答案

【问题标题】：Fill missing values based on spatial clustering method in PythonPython中基于空间聚类方法填充缺失值
【发布时间】：2020-08-29 14:46:50
【问题描述】：

给定一个数据框如下：

     latitude   longitude         user_service
0  -27.496404  153.014353      02: Duhig Tower
1  -27.497107  153.014836                  NaN
2  -27.497118  153.014890                  NaN
3  -27.497154  153.014813                  NaN
4  -27.496437  153.014477      12: Duhig North
5  -27.497156  153.014813  32: Gordon Greenwod
6  -27.497097  153.014746       23: Abel Smith
7  -27.496390  153.014415  32: Gordon Greenwod
8  -27.497112  153.014780           03: Steele
9  -27.497156  153.014813  32: Gordon Greenwod
10 -27.496487  153.014622      02: Duhig Tower
11 -27.497075  153.014532                  NaN
12 -27.497103  153.014817        25: UQ Sports
13 -27.496754  153.014504      02: Duhig Tower
14 -27.496567  153.014294      02: Duhig Tower
15 -27.497156  153.014813  32: Gordon Greenwod

由于user_service 列有缺失值，所以我想也许我可以使用空间聚类方法来填充 nans。

例如，对于第二行的latitude 和longitude 对-27.497107, 153.014836，如果02: Duhig Tower 的位置距离它最近，那么我想在@987654327 中填写nan @ 和 02: Duhig Tower 代表这一行。其他缺失行的逻辑相同。

如何在 Python 中实现上述逻辑？谢谢。

Guillermo Mosse 解决方案的输出，但还是有一些NaNs：

     latitude   longitude                   user_service
0  -27.499012  153.015180               51: Zelman Cowen
1  -27.497600  153.014479                     03: Steele
2  -27.500054  153.013435         50: Hawken Engineering
3  -27.495979  153.009834                            NaN
4  -27.496748  153.017507            32: Gordon Greenwod
5  -27.495695  153.016178  38: UQ Multi Faith Chaplaincy
6  -27.497015  153.012492               01: Forgan Smith
7  -27.498797  153.017267                            NaN
8  -27.500508  153.011360                       75: AIBN
9  -27.496763  153.013795               01: Forgan Smith
10 -27.494909  153.017187                            NaN
11 -27.496384  153.013810                12: Duhig North

查看NaNs：

var = df.loc[[2]].user_service
print(var)
print(type(var))
print(len(var))

输出：

2    NaN
Name: user_service, dtype: object
<class 'pandas.core.series.Series'>
1

【问题讨论】：

标签： python-3.x pandas scikit-learn k-means dbscan

【解决方案1】：

理想情况下，您希望使用 Pandas 的 interpolate 和自定义距离函数来填充 NaN 值，但该方法似乎无法以任何方式扩展。

一种可能的解决方案是，对于每个数据点，获取实际具有 service_name 的最近数据点的 service_name。这是一个可能的解决方案的完整工作示例：

import pandas as pd
from scipy.spatial.distance import cdist
import numpy as np

df = pd.DataFrame    ([
  [-27.496404,  153.014353,      "02: Duhig Tower"],
  [-27.497107,  153.014836,                  None],
  [-27.497118,  153.014890,                  None],
  [-27.497154,  153.014813,                  None],
  [-27.496437,  153.014477,      "12: Duhig North"],
  [-27.497156,  153.014813,  "32: Gordon Greenwod"],
  [-27.497097,  153.014746,       "23: Abel Smith"],
  [-27.496390,  153.014415,  "32: Gordon Greenwod"],
  [-27.497112,  153.014780,           "03: Steele"],
  [-27.497156,  153.014813,  "32: Gordon Greenwod"],
  [-27.496487,  153.014622,      "02: Duhig Tower"],
  [-27.497075,  153.014532,                  None],
  [-27.497103,  153.014817,        "25: UQ Sports"],
  [-27.496754,  153.014504,      "02: Duhig Tower"],
  [-27.496567,  153.014294,      "02: Duhig Tower"],
  [-27.497156,  153.014813,  "32: Gordon Greenwod"]],
    columns = ["latitude", "longitude", "user_service"])


def closest_point_service_name(point, points, user_services):
    """ Find closest point with non null user_service """

    #First we filter the points and user_services by the ones that don't have null user_service
    points = points[user_services != None]
    user_services = user_services[user_services != None]

    #we use cdist to get all distances between pairs of points
    distances = cdist([point], points)[0]

    #we don't want to consider the current point
    distances[distances == 0] = np.inf

    #we get the index of the closest point
    closest_point_index = distances.argmin()

    #we return the user_service of the closest point that has a user_service
    closest_point_user_service = user_services[closest_point_index]
    return closest_point_user_service

#we convert the lat and long to a pair
df['point'] = [(x, y) for x,y in zip(df['latitude'], df['longitude'])]

#we create the additional column
df['closest'] = [closest_point_service_name(x, np.asarray(list(df['point'])), np.asarray(list(df['user_service']))) for x in df['point']]

#finally, we fill nulls
df.user_service = df.user_service.fillna(df['closest'])

del df['closest']

df

这是输出：

latitude    longitude   user_service    point
0   -27.496404  153.014353  02: Duhig Tower     (-27.496404, 153.014353)
1   -27.497107  153.014836  25: UQ Sports   (-27.497107, 153.014836)
2   -27.497118  153.014890  25: UQ Sports   (-27.497118, 153.01489)
3   -27.497154  153.014813  32: Gordon Greenwod     (-27.497154, 153.014813)
4   -27.496437  153.014477  12: Duhig North     (-27.496437, 153.014477)
5   -27.497156  153.014813  32: Gordon Greenwod     (-27.497156, 153.014813)
6   -27.497097  153.014746  23: Abel Smith  (-27.497097, 153.014746)
7   -27.496390  153.014415  32: Gordon Greenwod     (-27.49639, 153.014415)
8   -27.497112  153.014780  03: Steele  (-27.497112, 153.01478)
9   -27.497156  153.014813  32: Gordon Greenwod     (-27.497156, 153.014813)
10  -27.496487  153.014622  02: Duhig Tower     (-27.496487, 153.014622)
11  -27.497075  153.014532  23: Abel Smith  (-27.497075, 153.014532)
12  -27.497103  153.014817  25: UQ Sports   (-27.497103, 153.014817)
13  -27.496754  153.014504  02: Duhig Tower     (-27.496754, 153.014504)
14  -27.496567  153.014294  02: Duhig Tower     (-27.496567, 153.014294)
15  -27.497156  153.014813  32: Gordon Greenwod     (-27.497156, 153.014813)

【讨论】：

非常感谢您的帮助。我尝试使用完整数据，一些缺失值仅由nans 填充。有什么想法吗？
看到这条线了吗？ user_services = user_services[user_services != None] 在这里我使用它是因为我只想考虑非空用户服务。在我的示例中，空值是“无”。您可能正在为空值使用不同的数据类型。您可以尝试使用 math.isnan() 函数。
因此，换句话说，我对您的建议是首先了解您为 nan 值使用的数据类型，然后相应地更新相应的行。如果你没有成功，请告诉我！
你的意思是使用user_services = user_services[math.isnan(user_services)]而不是user_services = user_services[user_services != None]
我会首先抓取一行并检查 math.isnan 实际上是否在您的一个空值上返回 True，但是是的！虽然反过来：你必须实际上否定 isnan，因为你只想保留非 nan 值。