【问题标题】:What is the correct way to get the first row of a dataframe?获取数据帧第一行的正确方法是什么?
【发布时间】:2021-02-07 03:20:19
【问题描述】:

test.csv 中的数据是这样的:

device_id,upload_time,latitude,longitude,mileage,other_vals,speed,upload_time_add_8hour,upload_time_year_month,car_id,car_type,car_num,marketer_name
1101,2020-09-30 16:03:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:03:41,202010,18,1,,
1101,2020-09-30 16:08:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:08:41,202010,18,1,,
1101,2020-09-30 16:13:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:13:41,202010,18,1,,
1101,2020-09-30 16:18:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:18:41,202010,18,1,,
1101,2020-10-02 08:19:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:19:41,202010,18,1,,
1101,2020-10-02 08:24:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:24:41,202010,18,1,,
1101,2020-10-02 08:29:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:29:41,202010,18,1,,
1101,2020-10-02 08:34:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:34:41,202010,18,1,,
1101,2020-10-02 08:39:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:39:41,202010,18,1,,
1101,2020-10-02 08:44:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:44:41,202010,18,1,,
1101,2020-10-02 08:49:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:49:41,202010,18,1,,
1101,2020-10-06 11:11:10+00:00,46.7245,131.14015,0.1,,2.1,2020/10/6 19:11:10,202010,18,1,,
1101,2020-10-06 11:16:10+00:00,46.7245,131.14015,0.1,,2.2,2020/10/6 19:16:10,202010,18,1,,
1101,2020-10-06 11:21:10+00:00,46.7245,131.14015,0.1,,3.84,2020/10/6 19:21:10,202010,18,1,,
1101,2020-10-06 16:46:10+00:00,46.7245,131.14015,0,,0,2020/10/7 0:46:10,202010,18,1,,
1101,2020-10-07 04:44:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:44:27,202010,18,1,,
1101,2020-10-07 04:49:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:49:27,202010,18,1,,
1101,2020-10-07 04:54:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:54:27,202010,18,1,,
1101,2020-10-07 04:59:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:59,202010,18,1,,
1101,2020-10-07 05:04:27+00:00,46.724366,131.1402,1,,0,2020/10/7 13:04:27,202010,18,1,,

我用这段代码在dataframe中获取速度为0的数据,然后按照纬度、经度、年月日对dataframe进行分组。

分组后,得到每组的第一个upload_time_add_8hour和最后一个upload_time_add_8hour。如果第一个upload_time_add_8hour和最后一个upload_time_add_8hour相差超过5分钟,则获取每组的第一行数据,最后将这些数据保存到csv中。

我觉得我的代码不够简洁。

我使用df_first_row = sub_df.iloc[0:1,:] 获取数据框中的第一行,我使用upload_time_add_8hour_first = sub_df['upload_time_add_8hour'].iloc[0]upload_time_add_8hour_last = sub_df['upload_time_add_8hour'].iloc[-1] 获取特定列的第一个元素和最后一个元素。

有没有更合适的方式?

我的代码:

import pandas as pd

device_csv_name = r'E:/test.csv'
df = pd.read_csv(device_csv_name, parse_dates=[7], encoding='utf-8', low_memory=False)
df['upload_time_year_month_day'] = df['upload_time_add_8hour'].dt.strftime('%Y%m%d')
df['upload_time_year_month_day'] = df['upload_time_year_month_day'].astype(str)
df_speed0 = df[df['speed'].astype(float) == 0.0] #Get data with speed is 0.0
gb = df_speed0.groupby(['latitude', 'longitude', 'upload_time_year_month_day'])
sub_dataframe_list = []
for i in gb.indices:
    sub_df = pd.DataFrame(gb.get_group(i))
    sub_df = sub_df.sort_values(by=['upload_time_add_8hour'])
    count_row = sub_df.shape[0] #get row count
    if count_row>1: #each group must have more then 1 row
        upload_time_add_8hour_first = sub_df['upload_time_add_8hour'].iloc[0]  # get first upload_time_add_8hour
        upload_time_add_8hour_last = sub_df['upload_time_add_8hour'].iloc[-1]  # get last upload_time_add_8hour
        minutes_diff = (upload_time_add_8hour_last - upload_time_add_8hour_first).total_seconds() / 60.0
        if minutes_diff >= 5: # if minutes_diff>5,append the first row of dataframe to sub_dataframe_list
            df_first_row  = sub_df.iloc[0:1,:]
            sub_dataframe_list.append(df_first_row)

if sub_dataframe_list:
    result = pd.concat(sub_dataframe_list,ignore_index=True)
    result = result.sort_values(by=['upload_time'])
    result.to_csv(r'E:/for_test.csv', index=False, mode='w', header=True,encoding='utf-8')

【问题讨论】:

    标签: python pandas


    【解决方案1】:

    要获取列的第一个和最后一个元素,您的选择已经是最有效/正确的方法。如果您对此主题感兴趣,我可以推荐您阅读其他 Stackoverflow 答案:https://stackoverflow.com/a/25254087/8294752

    为了获得第一行,我个人更喜欢使用 DataFrame.head(1),因此对于您的代码,如下所示:

    df_first_row = sub_df.head(1)

    我没有研究如何在 Pandas 中定义 head() 方法及其对性能的影响,但我认为它提高了可读性并减少了与索引的一些潜在混淆。

    在其他示例中,您可能还会找到sub_df.iloc[0],但此选项将返回一个pandas.Series,它以DataFrame 列名作为索引。 sub_df.head(1) 将返回一个 1 行 DataFrame,这与 sub_df.iloc[0:1,:] 的结果相同

    【讨论】:

      【解决方案2】:

      您的出路是groupby().aggdf. agg

      如果您根据设备需要它,您可以

      #sub_df.groupby('device_id')['upload_time_add_8hour'].agg(['first','last'])
      
      
      sub_df.groupby('device_id')['upload_time_add_8hour'].agg([('upload_time_add_8hour_first','first'),('upload_time_add_8hour_last ','last')]).reset_index()
      
      
      device_id upload_time_add_8hour_first    upload_time_add_8hour_last 
      0       1101              10/1/2020 0:03             10/7/2020 13:04
      

      如果您不希望按设备使用它,可以尝试

      sub_df['upload_time_add_8hour'].agg({'upload_time_add_8hour_first': lambda x: x.head(1),'upload_time_add_8hour_last': lambda x: x.tail(1)})
      
      upload_time_add_8hour_first  0      10/1/2020 0:03
      upload_time_add_8hour_last   19    10/7/2020 13:04
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2022-01-14
        • 1970-01-01
        • 2020-05-14
        • 1970-01-01
        相关资源
        最近更新 更多