【问题标题】:Fill missing values based on spatial clustering method in PythonPython中基于空间聚类方法填充缺失值
【发布时间】:2020-08-29 14:46:50
【问题描述】:

给定一个数据框如下:

     latitude   longitude         user_service
0  -27.496404  153.014353      02: Duhig Tower
1  -27.497107  153.014836                  NaN
2  -27.497118  153.014890                  NaN
3  -27.497154  153.014813                  NaN
4  -27.496437  153.014477      12: Duhig North
5  -27.497156  153.014813  32: Gordon Greenwod
6  -27.497097  153.014746       23: Abel Smith
7  -27.496390  153.014415  32: Gordon Greenwod
8  -27.497112  153.014780           03: Steele
9  -27.497156  153.014813  32: Gordon Greenwod
10 -27.496487  153.014622      02: Duhig Tower
11 -27.497075  153.014532                  NaN
12 -27.497103  153.014817        25: UQ Sports
13 -27.496754  153.014504      02: Duhig Tower
14 -27.496567  153.014294      02: Duhig Tower
15 -27.497156  153.014813  32: Gordon Greenwod

由于user_service 列有缺失值,所以我想也许我可以使用空间聚类方法来填充 nans。

例如,对于第二行的latitudelongitude-27.497107, 153.014836,如果02: Duhig Tower 的位置距离它最近,那么我想在@987654327 中填写nan @ 和 02: Duhig Tower 代表这一行。其他缺失行的逻辑相同。

如何在 Python 中实现上述逻辑?谢谢。

Guillermo Mosse 解决方案的输出,但还是有一些NaNs:

     latitude   longitude                   user_service
0  -27.499012  153.015180               51: Zelman Cowen
1  -27.497600  153.014479                     03: Steele
2  -27.500054  153.013435         50: Hawken Engineering
3  -27.495979  153.009834                            NaN
4  -27.496748  153.017507            32: Gordon Greenwod
5  -27.495695  153.016178  38: UQ Multi Faith Chaplaincy
6  -27.497015  153.012492               01: Forgan Smith
7  -27.498797  153.017267                            NaN
8  -27.500508  153.011360                       75: AIBN
9  -27.496763  153.013795               01: Forgan Smith
10 -27.494909  153.017187                            NaN
11 -27.496384  153.013810                12: Duhig North

查看NaNs:

var = df.loc[[2]].user_service
print(var)
print(type(var))
print(len(var))

输出:

2    NaN
Name: user_service, dtype: object
<class 'pandas.core.series.Series'>
1

【问题讨论】:

    标签: python-3.x pandas scikit-learn k-means dbscan


    【解决方案1】:

    理想情况下,您希望使用 Pandas 的 interpolate 和自定义距离函数来填充 NaN 值,但该方法似乎无法以任何方式扩展。

    一种可能的解决方案是,对于每个数据点,获取实际具有 service_name 的最近数据点的 service_name。这是一个可能的解决方案的完整工作示例:

    import pandas as pd
    from scipy.spatial.distance import cdist
    import numpy as np
    
    df = pd.DataFrame    ([
      [-27.496404,  153.014353,      "02: Duhig Tower"],
      [-27.497107,  153.014836,                  None],
      [-27.497118,  153.014890,                  None],
      [-27.497154,  153.014813,                  None],
      [-27.496437,  153.014477,      "12: Duhig North"],
      [-27.497156,  153.014813,  "32: Gordon Greenwod"],
      [-27.497097,  153.014746,       "23: Abel Smith"],
      [-27.496390,  153.014415,  "32: Gordon Greenwod"],
      [-27.497112,  153.014780,           "03: Steele"],
      [-27.497156,  153.014813,  "32: Gordon Greenwod"],
      [-27.496487,  153.014622,      "02: Duhig Tower"],
      [-27.497075,  153.014532,                  None],
      [-27.497103,  153.014817,        "25: UQ Sports"],
      [-27.496754,  153.014504,      "02: Duhig Tower"],
      [-27.496567,  153.014294,      "02: Duhig Tower"],
      [-27.497156,  153.014813,  "32: Gordon Greenwod"]],
        columns = ["latitude", "longitude", "user_service"])
    
    
    def closest_point_service_name(point, points, user_services):
        """ Find closest point with non null user_service """
    
        #First we filter the points and user_services by the ones that don't have null user_service
        points = points[user_services != None]
        user_services = user_services[user_services != None]
    
        #we use cdist to get all distances between pairs of points
        distances = cdist([point], points)[0]
    
        #we don't want to consider the current point
        distances[distances == 0] = np.inf
    
        #we get the index of the closest point
        closest_point_index = distances.argmin()
    
        #we return the user_service of the closest point that has a user_service
        closest_point_user_service = user_services[closest_point_index]
        return closest_point_user_service
    
    #we convert the lat and long to a pair
    df['point'] = [(x, y) for x,y in zip(df['latitude'], df['longitude'])]
    
    #we create the additional column
    df['closest'] = [closest_point_service_name(x, np.asarray(list(df['point'])), np.asarray(list(df['user_service']))) for x in df['point']]
    
    #finally, we fill nulls
    df.user_service = df.user_service.fillna(df['closest'])
    
    del df['closest']
    
    df
    

    这是输出:

    latitude    longitude   user_service    point
    0   -27.496404  153.014353  02: Duhig Tower     (-27.496404, 153.014353)
    1   -27.497107  153.014836  25: UQ Sports   (-27.497107, 153.014836)
    2   -27.497118  153.014890  25: UQ Sports   (-27.497118, 153.01489)
    3   -27.497154  153.014813  32: Gordon Greenwod     (-27.497154, 153.014813)
    4   -27.496437  153.014477  12: Duhig North     (-27.496437, 153.014477)
    5   -27.497156  153.014813  32: Gordon Greenwod     (-27.497156, 153.014813)
    6   -27.497097  153.014746  23: Abel Smith  (-27.497097, 153.014746)
    7   -27.496390  153.014415  32: Gordon Greenwod     (-27.49639, 153.014415)
    8   -27.497112  153.014780  03: Steele  (-27.497112, 153.01478)
    9   -27.497156  153.014813  32: Gordon Greenwod     (-27.497156, 153.014813)
    10  -27.496487  153.014622  02: Duhig Tower     (-27.496487, 153.014622)
    11  -27.497075  153.014532  23: Abel Smith  (-27.497075, 153.014532)
    12  -27.497103  153.014817  25: UQ Sports   (-27.497103, 153.014817)
    13  -27.496754  153.014504  02: Duhig Tower     (-27.496754, 153.014504)
    14  -27.496567  153.014294  02: Duhig Tower     (-27.496567, 153.014294)
    15  -27.497156  153.014813  32: Gordon Greenwod     (-27.497156, 153.014813)
    

    【讨论】:

    • 非常感谢您的帮助。我尝试使用完整数据,一些缺失值仅由nans 填充。有什么想法吗?
    • 看到这条线了吗? user_services = user_services[user_services != None] 在这里我使用它是因为我只想考虑非空用户服务。在我的示例中,空值是“无”。您可能正在为空值使用不同的数据类型。您可以尝试使用 math.isnan() 函数。
    • 因此,换句话说,我对您的建议是首先了解您为 nan 值使用的数据类型,然后相应地更新相应的行。如果你没有成功,请告诉我!
    • 你的意思是使用user_services = user_services[math.isnan(user_services)]而不是user_services = user_services[user_services != None]
    • 我会首先抓取一行并检查 math.isnan 实际上是否在您的一个空值上返回 True,但是是的!虽然反过来:你必须实际上否定 isnan,因为你只想保留非 nan 值。
    猜你喜欢
    • 2022-01-07
    • 2021-10-03
    • 2022-01-11
    • 1970-01-01
    • 2014-03-02
    • 1970-01-01
    • 2012-10-25
    相关资源
    最近更新 更多