【问题标题】:Python location, show distance from closest other locationPython位置,显示与最近其他位置的距离
【发布时间】:2020-11-27 17:39:49
【问题描述】:

我是数据框中的一个位置,位于 lat lon 列名下方。我想在单独的数据框中显示距离最近火车站的纬度有多远。

例如,我的纬度为 (37.814563 144.970267),并且我有如下其他地理空间点的列表。我想找到最近的点,然后找到这些点之间的距离,作为郊区数据框中的额外列。

这是训练数据集的示例

<bound method NDFrame.to_clipboard of   STOP_ID                                          STOP_NAME   LATITUDE  \
0   19970             Royal Park Railway Station (Parkville) -37.781193   
1   19971  Flemington Bridge Railway Station (North Melbo... -37.788140   
2   19972         Macaulay Railway Station (North Melbourne) -37.794267   
3   19973   North Melbourne Railway Station (West Melbourne) -37.807419   
4   19974        Clifton Hill Railway Station (Clifton Hill) -37.788657   

    LONGITUDE TICKETZONE                                          ROUTEUSSP  \
0  144.952301          1                                            Upfield   
1  144.939323          1                                            Upfield   
2  144.936166          1                                            Upfield   
3  144.942570          1  Flemington,Sunbury,Upfield,Werribee,Williamsto...   
4  144.995417          1                                 Mernda,Hurstbridge   

                      geometry  
0  POINT (144.95230 -37.78119)  
1  POINT (144.93932 -37.78814)  
2  POINT (144.93617 -37.79427)  
3  POINT (144.94257 -37.80742)  
4  POINT (144.99542 -37.78866)  >

这是郊区的一个例子

<bound method NDFrame.to_clipboard of       postcode              suburb state        lat         lon
4901      3000           MELBOURNE   VIC -37.814563  144.970267
4902      3002      EAST MELBOURNE   VIC -37.816640  144.987811
4903      3003      WEST MELBOURNE   VIC -37.806255  144.941123
4904      3005  WORLD TRADE CENTRE   VIC -37.822262  144.954856
4905      3006           SOUTHBANK   VIC -37.823258  144.965926>

我想在郊区列表的新列中显示从 lat lon 到 closet 火车站的距离。

使用解决方案得到一个奇怪的输出,想知道它是否正确?

显示两种解决方案,

from sklearn.neighbors import NearestNeighbors
from haversine import haversine

NN = NearestNeighbors(n_neighbors=1, metric='haversine')
NN.fit(trains_shape[['LATITUDE', 'LONGITUDE']])

indices = NN.kneighbors(df_complete[['lat', 'lon']])[1]
indices = [index[0] for index in indices]
distances = NN.kneighbors(df_complete[['lat', 'lon']])[0]
df_complete['closest_station'] = trains_shape.iloc[indices]['STOP_NAME'].reset_index(drop=True)
df_complete['closest_station_distances'] = distances
print(df_complete)

这里的输出,

<bound method NDFrame.to_clipboard of    postcode        suburb state        lat         lon  Venues Cluster  \
1      3040    aberfeldie   VIC -37.756690  144.896259             4.0   
2      3042  airport west   VIC -37.711698  144.887037             1.0   
4      3206   albert park   VIC -37.840705  144.955710             0.0   
5      3020        albion   VIC -37.775954  144.819395             2.0   
6      3078    alphington   VIC -37.780767  145.031160             4.0   

                     #1                    #2             #3  \
1                  Café     Electronics Store  Grocery Store   
2  Fast Food Restaurant                  Café    Supermarket   
4                  Café                   Pub    Coffee Shop   
5                  Café  Fast Food Restaurant  Grocery Store   
6                  Café                  Park            Bar   

                      #4  ...                             #6  \
1            Coffee Shop  ...                         Bakery   
2          Grocery Store  ...             Italian Restaurant   
4         Breakfast Spot  ...                   Burger Joint   
5  Vietnamese Restaurant  ...                            Pub   
6            Pizza Place  ...  Vegetarian / Vegan Restaurant   

                      #7                   #8                         #9  \
1          Shopping Mall  Japanese Restaurant          Indian Restaurant   
2  Portuguese Restaurant    Electronics Store  Middle Eastern Restaurant   
4                    Bar               Bakery                  Gastropub   
5     Chinese Restaurant                  Gym                     Bakery   
6     Italian Restaurant            Gastropub                     Bakery   

                 #10 Ancestry Cluster  ClosestStopId  \
1   Greek Restaurant              8.0          20037   
2  Convenience Store              5.0          20032   
4              Beach              6.0          22180   
5  Convenience Store              5.0          20004   
6        Coffee Shop              5.0          19931   

                                   ClosestStopName  \
1              Essendon Railway Station (Essendon)   
2                Glenroy Railway Station (Glenroy)   
4  Southern Cross Railway Station (Melbourne City)   
5          Albion Railway Station (Sunshine North)   
6          Alphington Railway Station (Alphington)   

                                   closest_station closest_station_distances  
1                Glenroy Railway Station (Glenroy)                  0.019918  
2  Southern Cross Railway Station (Melbourne City)                  0.031020  
4          Alphington Railway Station (Alphington)                  0.023165  
5                  Altona Railway Station (Altona)                  0.005559  
6                Newport Railway Station (Newport)                  0.002375  

还有第二个功能。

def ClosestStop(r):
    # Cartesin Distance: square root of (x2-x2)^2 + (y2-y1)^2
    distances = ((r['lat']-StationDf['LATITUDE'])**2 + (r['lon']-StationDf['LONGITUDE'])**2)**0.5
    
    # Stop with minimum Distance from the Suburb
    closestStationId = distances[distances == distances.min()].index.to_list()[0]
    return StationDf.loc[closestStationId, ['STOP_ID', 'STOP_NAME']]

df_complete[['ClosestStopId', 'ClosestStopName']] = df_complete.apply(ClosestStop, axis=1)

这很奇怪地给出了不同的答案,让我认为这段代码有问题。 KM 似乎也是错误的。

完全不确定如何解决这个问题 - 希望得到一些指导,谢谢!

【问题讨论】:

  • 您需要 1. 一个函数distance(lat1, lon1, lat2, lon2), 2. 适用于郊区和车站的每个组合, 3. 获取每个郊区最短距离的车站并添加到数据框中。 (或者使用 sklearn 的 NearestNeighbor 分类器)
  • 在这里查看答案stackoverflow.com/q/365826/6692898
  • 在第一个解决方案中,您在 NN 中使用“haversine”作为距离函数,它是 sklearn 中内置的 hasrsine 距离,以半径表示。您可以在我的回答中看到该文档的链接。要获得以 km 表示的半正弦距离,请使用导入的半正弦包作为 NN 中的距离。我的回答中也表达了这一点。
  • 你能分享你想计算距离的城市和车站的数量吗?我这里还没有可扩展的 BallTree 算法示例,当数字扩大时,这是你需要的。

标签: python pandas


【解决方案1】:

几个关键概念

  1. 在两个数据帧之间做笛卡尔积以获得所有组合(在两个数据帧之间加入相同的值是接近这个foo=1
  2. 一旦两组数据放在一起,使用两组纬度/经度来计算距离)geopy 已用于此
  3. 清理列,使用sort_values() 查找最小距离
  4. 最后是 groupby()agg() 以获得 first 最短距离值

有两个数据框可供使用

  1. dfdist 包含所有组合和距离
  2. dfnearest 包含结果
dfstat = pd.DataFrame({'STOP_ID': ['19970', '19971', '19972', '19973', '19974'],
 'STOP_NAME': ['Royal Park Railway Station (Parkville)',
  'Flemington Bridge Railway Station (North Melbo...',
  'Macaulay Railway Station (North Melbourne)',
  'North Melbourne Railway Station (West Melbourne)',
  'Clifton Hill Railway Station (Clifton Hill)'],
 'LATITUDE': ['-37.781193',
  '-37.788140',
  '-37.794267',
  '-37.807419',
  '-37.788657'],
 'LONGITUDE': ['144.952301',
  '144.939323',
  '144.936166',
  '144.942570',
  '144.995417'],
 'TICKETZONE': ['1', '1', '1', '1', '1'],
 'ROUTEUSSP': ['Upfield',
  'Upfield',
  'Upfield',
  'Flemington,Sunbury,Upfield,Werribee,Williamsto...',
  'Mernda,Hurstbridge'],
 'geometry': ['POINT (144.95230 -37.78119)',
  'POINT (144.93932 -37.78814)',
  'POINT (144.93617 -37.79427)',
  'POINT (144.94257 -37.80742)',
  'POINT (144.99542 -37.78866)']})
dfsub = pd.DataFrame({'id': ['4901', '4902', '4903', '4904', '4905'],
 'postcode': ['3000', '3002', '3003', '3005', '3006'],
 'suburb': ['MELBOURNE',
  'EAST MELBOURNE',
  'WEST MELBOURNE',
  'WORLD TRADE CENTRE',
  'SOUTHBANK'],
 'state': ['VIC', 'VIC', 'VIC', 'VIC', 'VIC'],
 'lat': ['-37.814563', '-37.816640', '-37.806255', '-37.822262', '-37.823258'],
 'lon': ['144.970267', '144.987811', '144.941123', '144.954856', '144.965926']})

import geopy.distance
# cartesian product so we get all combinations
dfdist = (dfsub.assign(foo=1).merge(dfstat.assign(foo=1), on="foo")
    # calc distance in km between each suburb and each train station
     .assign(km=lambda dfa: dfa.apply(lambda r: 
                                      geopy.distance.geodesic(
                                          (r["LATITUDE"],r["LONGITUDE"]), 
                                          (r["lat"],r["lon"])).km, axis=1))
    # reduce number of columns to make it more digestable
     .loc[:,["postcode","suburb","STOP_ID","STOP_NAME","km"]]
    # sort so shortest distance station from a suburb is first
     .sort_values(["postcode","suburb","km"])
    # good practice
     .reset_index(drop=True)
)
# finally pick out stations nearest to suburb
# this can easily be joined back to source data frames as postcode and STOP_ID have been maintained
dfnearest = dfdist.groupby(["postcode","suburb"])\
    .agg({"STOP_ID":"first","STOP_NAME":"first","km":"first"}).reset_index()

print(dfnearest.to_string(index=False))
dfnearest

输出

postcode              suburb STOP_ID                                         STOP_NAME        km
    3000           MELBOURNE   19973  North Melbourne Railway Station (West Melbourne)  2.564586
    3002      EAST MELBOURNE   19974       Clifton Hill Railway Station (Clifton Hill)  3.177320
    3003      WEST MELBOURNE   19973  North Melbourne Railway Station (West Melbourne)  0.181463
    3005  WORLD TRADE CENTRE   19973  North Melbourne Railway Station (West Melbourne)  1.970909
    3006           SOUTHBANK   19973  North Melbourne Railway Station (West Melbourne)  2.705553

一种减少测试组合大小的方法

# pick nearer places,  based on lon/lat then all combinations
dfdist = (dfsub.assign(foo=1, latr=dfsub["lat"].round(1), lonr=dfsub["lon"].round(1))
          .merge(dfstat.assign(foo=1, latr=dfstat["LATITUDE"].round(1), lonr=dfstat["LONGITUDE"].round(1)), 
                 on=["foo","latr","lonr"])
    # calc distance in km between each suburb and each train station
     .assign(km=lambda dfa: dfa.apply(lambda r: 
                                      geopy.distance.geodesic(
                                          (r["LATITUDE"],r["LONGITUDE"]), 
                                          (r["lat"],r["lon"])).km, axis=1))
    # reduce number of columns to make it more digestable
     .loc[:,["postcode","suburb","STOP_ID","STOP_NAME","km"]]
    # sort so shortest distance station from a suburb is first
     .sort_values(["postcode","suburb","km"])
    # good practice
     .reset_index(drop=True)
)

【讨论】:

  • 嗨,伙计,这很好,只是当我用它来处理属性时,它会占用我所有的内存:P 有没有更有效的方法,或者批处理的方法?跨度>
  • 如果您有大型数据集,那将是导致问题的纯笛卡尔积......您是否有跨多个城市的地址和站点?如果是这样,我建议在生成dfdest 时添加城市以加入密钥。即不要生成不相关的组合...
  • ^^ 好点。它仅适用于一个城市,特别适用于到火车站(和公共汽车等)的距离。我在考虑批处理什么的?
  • 刚刚添加以作为一个想法回答。我希望更近的位置具有相同的圆形经度/纬度
【解决方案2】:

我想发布一篇我自己发现并尝试过的文章,它在我上大学时很有效。您可以使用Google Distance Matrix Api。我不想显示特定的代码,而是建议您参考文章本身:

https://medium.com/how-to-use-google-distance-matrix-api-in-python/how-to-use-google-distance-matrix-api-in-python-ef9cd895303c

对于按纬度和经度坐标行组织的给定数据集,您可以计算连续行之间的距离。这将为您提供两个不同点之间的实际距离。

【讨论】:

    【解决方案3】:

    试试这个

    import pandas as pd
    def ClosestStop(r):
        # Cartesin Distance: square root of (x2-x2)^2 + (y2-y1)^2
        distances = ((r['lat']-StationDf['LATITUDE'])**2 + (r['lon']-StationDf['LONGITUDE'])**2)**0.5
        
        # Stop with minimum Distance from the Suburb
        closestStationId = distances[distances == distances.min()].index.to_list()[0]
        return StationDf.loc[closestStationId, ['STOP_ID', 'STOP_NAME']]
    
    StationDf = pd.read_excel("StationData.xlsx")
    SuburbDf = pd.read_excel("SuburbData.xlsx")
    
    SuburbDf[['ClosestStopId', 'ClosestStopName']] = SuburbDf.apply(ClosestStop, axis=1)
    print(SuburbDf)
    

    【讨论】:

    • 笛卡尔距离不适用于具有 GPS 坐标的距离。查看半正弦距离en.m.wikipedia.org/wiki/Haversine_formula
    • @SoufianeK 是的,当您处理几个度数的经纬度变化(即全局距离)时,笛卡尔距离不适合。但这里的目标是找到离郊区最近的火车站,几乎没有覆盖一个度数(经纬度)的区域。另外,这里距离的大小和单位并不重要,重要的是距离如何比较。因此,笛卡尔距离足以达到目的。感谢分享链接,我从事 GIS 制图工作,会有所帮助。
    【解决方案4】:

    您可以使用带半正弦距离的sklearn.neighbors.NearestNeighbors

    import pandas as pd
    dfstat = pd.DataFrame({'STOP_ID': ['19970', '19971', '19972', '19973', '19974'],
                           'STOP_NAME': ['Royal Park Railway Station (Parkville)',  'Flemington Bridge Railway Station (North Melbo...',  'Macaulay Railway Station (North Melbourne)',  'North Melbourne Railway Station (West Melbourne)',  'Clifton Hill Railway Station (Clifton Hill)'],
                           'LATITUDE': ['-37.781193', '-37.788140',  '-37.794267',  '-37.807419',  '-37.788657'],
                           'LONGITUDE': ['144.952301', '144.939323', '144.936166',  '144.942570',  '144.995417'],
                           'TICKETZONE': ['1', '1', '1', '1', '1'], 
                           'ROUTEUSSP': ['Upfield',  'Upfield',  'Upfield',  'Flemington,Sunbury,Upfield,Werribee,Williamsto...',  'Mernda,Hurstbridge'],
                           'geometry': ['POINT (144.95230 -37.78119)',  'POINT (144.93932 -37.78814)',  'POINT (144.93617 -37.79427)',  'POINT (144.94257 -37.80742)',  'POINT (144.99542 -37.78866)']})
    dfsub = pd.DataFrame({'id': ['4901', '4902', '4903', '4904', '4905'],
                          'postcode': ['3000', '3002', '3003', '3005', '3006'],
                          'suburb': ['MELBOURNE',  'EAST MELBOURNE',  'WEST MELBOURNE',  'WORLD TRADE CENTRE',  'SOUTHBANK'],
                          'state': ['VIC', 'VIC', 'VIC', 'VIC', 'VIC'],
                          'lat': ['-37.814563', '-37.816640', '-37.806255', '-37.822262', '-37.823258'],
                          'lon': ['144.970267', '144.987811', '144.941123', '144.954856', '144.965926']})
    

    让我们首先在数据框中找到离某个随机点最近的点,比如-37.814563, 144.970267

    NN = NearestNeighbors(n_neighbors=1, metric='haversine')
    NN.fit(dfstat[['LATITUDE', 'LONGITUDE']])
    NN.kneighbors([[-37.814563, 144.970267]])
    

    输出为(array([[2.55952637]]), array([[3]])),即数据框中最近点的距离和索引。 sklearn 中的半正弦距离在radius 中。如果要计算单位为km,可以使用haversine

    from haversine import haversine
    NN = NearestNeighbors(n_neighbors=1, metric=haversine)
    NN.fit(dfstat[['LATITUDE', 'LONGITUDE']])
    NN.kneighbors([[-37.814563, 144.970267]])
    

    输出(array([[2.55952637]]), array([[3]])) 的距离以公里为单位。

    现在您可以应用到数据框中的所有点,并使用索引获取最近的站点。

    indices = NN.kneighbors(dfsub[['lat', 'lon']])[1]
    indices = [index[0] for index in indices]
    distances = NN.kneighbors(dfsub[['lat', 'lon']])[0]
    dfsub['closest_station'] = dfstat.iloc[indices]['STOP_NAME'].reset_index(drop=True)
    dfsub['closest_station_distances'] = distances
    print(dfsub)
    id  postcode    suburb  state   lat lon closest_station closest_station_distances
    0   4901    3000    MELBOURNE   VIC -37.814563  144.970267  North Melbourne Railway Station (West Melbourne)    2.559526
    1   4902    3002    EAST MELBOURNE  VIC -37.816640  144.987811  Clifton Hill Railway Station (Clifton Hill) 3.182521
    2   4903    3003    WEST MELBOURNE  VIC -37.806255  144.941123  North Melbourne Railway Station (West Melbourne)    0.181419
    3   4904    3005    WORLD TRADE CENTRE  VIC -37.822262  144.954856  North Melbourne Railway Station (West Melbourne)    1.972010
    4   4905    3006    SOUTHBANK   VIC -37.823258  144.965926  North Melbourne Railway Station (West Melbourne)    2.703926
    

    【讨论】:

    • 我刚刚添加了一个距离列,一些导入,并更正了缺少的括号。
    • 我复制了代码,它在我的示例中是否正常工作?看起来距离相差了一个数量级?
    • 通常 - 正弦的距离是大圆距离。您需要将其转换回泥土公里。我这里解释了stackoverflow.com/questions/63121268/…例子是query_radius。 Sklearn 还支持最近邻。它的实现与下面使用的 NN 非常相似(相同的速度)。不要使用需要计算全距离矩阵的方法。
    • 尝试将 NN 定义为 from haversine import haversine NN = NearestNeighbors(n_neighbors=1, metric=haversine)。请注意,metric=haversine 中没有引号。事实上metric='haversine' 使用 sklearn 中的半正弦距离,以半径表示。
    • @SoufianeK 不幸的是,这给了我错误的距离。我喜欢使用这种方法的想法,因为它更准确,但不幸的是,它看起来有点难以实现:/
    猜你喜欢
    • 2013-06-02
    • 1970-01-01
    • 2017-08-10
    • 1970-01-01
    • 2023-03-06
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-01-07
    相关资源
    最近更新 更多