【问题标题】:Pandas: Running a calculation on rows of a data table based on multiple columns and storing the output in a new columnPandas:基于多列对数据表的行运行计算并将输出存储在新列中
【发布时间】:2021-10-11 14:55:15
【问题描述】:

我正在尝试计算与 2 个位置的距离,并且我已经获得了两个目的地的经度和纬度。在我的 CSV 中,我有 4 列(lat1、lon1、lat2、lon2),如何应用下面的代码,以便创建名为“距离”的第 5 列,并使用下面的代码计算距离?

import math
from math import sin, cos, sqrt, atan2, radians

# approximate radius of earth in km
R = 6373.0

#Test
lat1 = radians(25.2296756)
lon1 = radians(36.0122287)
lat2 = radians(51.406374)
lon2 = radians(20.9251681)

dlon = lon2 - lon1
dlat = lat2 - lat1

a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))

distance = R * c

print("Result:", distance)
print("Should be:", 3181.11, "km")

数据框:

df = pd.DataFrame({'Normalised': {(0, 'London,', 'United', 'Kingdom'): '-',
  (1, 'Johannesburg,', 'South', 'Africa'): '-',
  (2, 'London,', 'United', 'Kingdom'): '-',
  (3, 'Johannesburg,', 'South', 'Africa'): '-',
  (4, 'London,', 'United', 'Kingdom'): '-'},
 'City': {(0, 'London,', 'United', 'Kingdom'): 'New',
  (1, 'Johannesburg,', 'South', 'Africa'): 'London,',
  (2, 'London,', 'United', 'Kingdom'): 'New',
  (3, 'Johannesburg,', 'South', 'Africa'): 'London,',
  (4, 'London,', 'United', 'Kingdom'): 'Singapore,'},
 'Pair': {(0, 'London,', 'United', 'Kingdom'): 'York,',
  (1, 'Johannesburg,', 'South', 'Africa'): 'United',
  (2, 'London,', 'United', 'Kingdom'): 'York,',
  (3, 'Johannesburg,', 'South', 'Africa'): 'United',
  (4, 'London,', 'United', 'Kingdom'): 'Singapore'},
 'Departure': {(0, 'London,', 'United', 'Kingdom'): 'United',
  (1, 'Johannesburg,', 'South', 'Africa'): 'Ki...',
  (2, 'London,', 'United', 'Kingdom'): 'United',
  (3, 'Johannesburg,', 'South', 'Africa'): 'Ki...',
  (4, 'London,', 'United', 'Kingdom'): 'SIN'},
 'Code': {(0, 'London,', 'United', 'Kingdom'): 'Stat.',
  (1, 'Johannesburg,', 'South', 'Africa'): 'JNB',
  (2, 'London,', 'United', 'Kingdom'): 'Stat',
  (3, 'Johannesburg,', 'South', 'Africa'): 'JNB',
  (4, 'London,', 'United', 'Kingdom'): 'LHR'},
 'Arrival': {(0, 'London,', 'United', 'Kingdom'): 'LHR',
  (1, 'Johannesburg,', 'South', 'Africa'): 'LHR',
  (2, 'London,', 'United', 'Kingdom'): 'LHR',
  (3, 'Johannesburg,', 'South', 'Africa'): 'LHR',
  (4, 'London,', 'United', 'Kingdom'): '1.3'},
 'Code.1': {(0, 'London,', 'United', 'Kingdom'): 'JFK',
  (1, 'Johannesburg,', 'South', 'Africa'): '-26.1',
  (2, 'London,', 'United', 'Kingdom'): 'JFK',
  (3, 'Johannesburg,', 'South', 'Africa'): '-26.1',
  (4, 'London,', 'United', 'Kingdom'): '103.98'},
 'Departure_lat': {(0, 'London,', 'United', 'Kingdom'): 51.5,
  (1, 'Johannesburg,', 'South', 'Africa'): 28.23,
  (2, 'London,', 'United', 'Kingdom'): 51.5,
  (3, 'Johannesburg,', 'South', 'Africa'): 28.23,
  (4, 'London,', 'United', 'Kingdom'): 51.47},
 'Departure_lon': {(0, 'London,', 'United', 'Kingdom'): -0.45,
  (1, 'Johannesburg,', 'South', 'Africa'): 51.47,
  (2, 'London,', 'United', 'Kingdom'): -0.45,
  (3, 'Johannesburg,', 'South', 'Africa'): 51.47,
  (4, 'London,', 'United', 'Kingdom'): -0.45},
 'Arrival_lat': {(0, 'London,', 'United', 'Kingdom'): 40.64,
  (1, 'Johannesburg,', 'South', 'Africa'): -0.45,
  (2, 'London,', 'United', 'Kingdom'): 40.64,
  (3, 'Johannesburg,', 'South', 'Africa'): -0.45,
  (4, 'London,', 'United', 'Kingdom'): np.nan},
 'Arrival_lon': {(0, 'London,', 'United', 'Kingdom'): -73.79,
  (1, 'Johannesburg,', 'South', 'Africa'): np.nan,
  (2, 'London,', 'United', 'Kingdom'): -73.79,
  (3, 'Johannesburg,', 'South', 'Africa'): np.nan,
  (4, 'London,', 'United', 'Kingdom'): np.nan}})

【问题讨论】:

标签: python pandas dataframe distance haversine


【解决方案1】:

您可以为距离计算定义自定义函数。然后,使用.apply()调用每一行的函数,得到每一行的距离。

1.定义一个自定义函数进行距离计算,如下:

import math
from math import sin, cos, sqrt, atan2, radians

def get_distance(in_lat1, in_lon1, in_lat2, in_lon2):
    # approximate radius of earth in km
    R = 6373.0

    lat1 = radians(in_lat1)
    lon1 = radians(in_lon1)
    lat2 = radians(in_lat2)
    lon2 = radians(in_lon2)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c

    return distance

2。使用.apply()在每一行调用并应用函数,得到每一行的距离,如下:

df['Distance'] = df.apply(lambda x: get_distance(x['Departure_lat'], x['Departure_lon'], x['Arrival_lat'], x['Arrival_lon']), axis=1)

演示

输入数据框

        City  Departure_lat  Departure_lon  Arrival_lat  Arrival_lon
0  CityName1      25.229676      36.012229    51.406374    20.925168

输出

        City  Departure_lat  Departure_lon  Arrival_lat  Arrival_lon    Distance
0  CityName1      25.229676      36.012229    51.406374    20.925168  3181.11039

【讨论】:

【解决方案2】:

您可以将 dlondlatac 都设置为一些临时列,然后从那里计算(或者将它们全部放在一个难以阅读的行中)。

类似:

df['dlon'] = df['Arrival_lon'] - df['Departure_lon']
df['dlat'] = df['Arrival_lat'] - df['Departure_lat']

df['a'] = sin(df['dlat'] / 2)**2 + cos(df['Departure_lat']) * cos(df['Arrival_lat']) * sin(df['dlon'] / 2)**2
df['c'] = 2 * atan2(sqrt(df['a']), sqrt(1 - df['a']))

df['distance'] = R * df['c']

然后,如果需要,您可以 .drop() 所有这些额外的列,但这应该创建 df['distance'] 作为为每一行计算的新列。

如果我在该代码中出现拼写错误,我不会感到惊讶,但希望您能理解。每一行 df[xxx] = 都会构成新列。

【讨论】:

    【解决方案3】:

    你没有提供数据,所以我根据你的问题自己编了;只需在您的列上使用这些函数的numpy 版本。

    import pandas as pd
    import numpy as np
    
    row = pd.Series({
        "lat1": 25.2296756,
        "lon1": 36.0122287,
        "lat2": 51.406374,
        "lon2": 20.9251681
    })
    df = pd.concat([row]*5, axis=1).T.apply(np.radians)
    
    df["dlon"] = df.lon2 - df.lon1
    df["dlat"] = df.lat2 - df.lat1
    
    R = 6373
    a = np.sin(df.dlat / 2)**2 + np.cos(df.lat1) * np.cos(df.lat2) * np.sin(df.dlon / 2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    df["distance"] = R*c
    

    生成的数据框如下所示:

           lat1      lon1     lat2      lon2      dlon     dlat    distance
    0  0.440341  0.628532  0.89721  0.365213 -0.263319  0.45687  3181.11039
    1  0.440341  0.628532  0.89721  0.365213 -0.263319  0.45687  3181.11039
    2  0.440341  0.628532  0.89721  0.365213 -0.263319  0.45687  3181.11039
    3  0.440341  0.628532  0.89721  0.365213 -0.263319  0.45687  3181.11039
    4  0.440341  0.628532  0.89721  0.365213 -0.263319  0.45687  3181.11039
    

    【讨论】:

      【解决方案4】:

      你可以把你的计算代码放在一个函数中:

      def calculate_distance(lat1,lon1,lat2,lon2):
        # approximate radius of earth in km
        R = 6373.0
      
        lat1 = radians(lat1)
        lon1 = radians(lon1)
        lat2 = radians(lat2)
        lon2 = radians(lon2)
      
        dlon = lon2 - lon1
        dlat = lat2 - lat1
      
        a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
        c = 2 * atan2(sqrt(a), sqrt(1 - a))
      
        distance = R * c
      
        return distance
      

      然后用列表理解将其应用到每一行:

      df['distance'] = [calculate_distance(row.lat1, row.lon1, row.lat2, row.lon2) for row in df.itertuples() ]
      

      【讨论】:

        【解决方案5】:

        根据您的数据 csv 文件的格式,可以使用类似于以下内容的内容。

        本质上,您需要将计算转换为可调用函数,然后在数据文件中的每一行上调用它,然后使用 csv 库将其导入 python。

        import math
        import csv # Added import for importing csv into python.
        from math import sin, cos, sqrt, atan2, radians
        
        # Import the data from the csv file.
        with open('data.csv', newline='') as csvfile:
            data = list(csv.reader(csvfile))
            
        # Approximate radius of earth in km.
        R = 6373.0
        
        # Create a distance calculation function.
        def calculate_distance(lat1_d, lon1_d, lat2_d, lon2_d):
            
            # Convert from degrees to radians.
            lat1 = radians(lat1_d)
            lon1 = radians(lon1_d)
            lat2 = radians(lat2_d)
            lon2 = radians(lon2_d)
        
            dlon = lon2 - lon1
            dlat = lat2 - lat1
        
            a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
            c = 2 * atan2(sqrt(a), sqrt(1 - a))
            
            distance = R * c
            return distance
            
        # Use list comprehension to run function on every data row.
        distances = [calculate_distance(row[0],row[1],row[2],row[3]) for row in data]
        
        # Append distance column to original array to create output.
        output = [row + [distance[index]] for index,row in enumerate(data)]
        

        请注意,row[0],row[1],row[2],row[3] 指的是数据数组/csv 文件中列的顺序。这些可能需要根据需要重新排序,以符合函数声明的输入顺序,即:lat1_d, lon1_d, lat2_d, lon2_d

        # Import the data from the csv file.
        with open('data.csv', newline='') as csvfile:
            data = list(csv.reader(csvfile))
        

        还需要调整这些导入参数以考虑 csv 文件的格式和名称。

        【讨论】:

        • 也像一个魅力..上帝保佑你!
        • @RedRum 很高兴听到,花了我一段时间来整理!如果您认为这是最完整的答案,请您考虑通过单击 ✓ 来接受它是正确的,谢谢。
        猜你喜欢
        • 2017-12-24
        • 1970-01-01
        • 2021-07-24
        • 1970-01-01
        • 2012-12-01
        • 2022-01-24
        • 2020-10-25
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多