【问题标题】:Unnecessary duplication is created while creating new dataframe that takes values from another by iterating over column values在创建通过迭代列值从另一个获取值的新数据框时会创建不必要的重复
【发布时间】:2026-02-11 07:30:01
【问题描述】:

我正在尝试通过迭代唯一值(合同编号)来添加从一个数据框列中获取的值。对于较少的迭代次数,该脚本可以完美运行。但是,迭代超过 1000 个唯一值,它会在结果数据帧中创建重复值,这反过来会减慢处理速度并花费不必要的长时间进行处理。 我应该如何提高效率?

https://imgur.com/3obXPne - 原始数据框

https://imgur.com/mEA8g6Z - 新数据帧中不必要的重复数据帧

https://imgur.com/3i5gMoJ- 新数据帧中不必要的重复数据帧

import pandas as pd
import numpy as np
from datetime import datetime

df = pd.DataFrame([["AB1111",'2018-08-15 00:00:00','164','123','123'],
                   ["AB1111",'2018-08-15 00:03:00','564','453','126'],
                   ["AB1111",'2018-08-15 00:10:00','364','1231','1223'],
                   ["AB1111",'2018-08-15 00:01:00','564','575','1523'],
                   ["CD1111",'2018-08-16 00:12:00','514','341','1213'],
                   ["CD1111",'2018-08-15 00:02:00','564','1234','123'],
                   ["CD1111",'2018-08-16 00:05:00','564','341','124'],
                   ["CD1111",'2018-08-16 00:03:00','64','341','123'],
                   ["EF1111",'2018-08-15 00:00:00','534','341','121'],
                   ["EF1111",'2018-08-17 00:01:00','564','341','163'],
                   ["EF1111",'2018-08-15 00:09:00','524','341','129']],
                   columns = ['contract', 'datetime',
                              'real_cons','solar_gen','battery_charge'])


# converting datetime column datatype to "datetime"
df['datetime'] = pd.to_datetime(df['datetime']) 

#aggregation dataframe (new dataframe)
df_agg1 = pd.DataFrame()

for contract in df['contract'].unique()[:1500]:
    print(contract)
    df_contract = df.copy()[df['contract']==contract]    # selecting each full dataframe from the main DF
    df_contract.set_index('datetime', inplace=True)      # set "datetime" column as an index
    df_contract.sort_index(inplace=True)                 # sort index
    df_contract = df_contract.loc['2018-8-15']           # select timeframe       
    # creating GB61074_cons column, which will be added to df_agg, from df_contract 'real_cons' column
    df_contract[f'{contract}_con'] = df_contract['real_cons']   

    if df_agg1.empty:
        df_agg1 = df_contract[[f'{contract}_con']]        # first column 
    else:
        df_agg1 = df_agg1.join(df_contract[f'{contract}_con'])     # subsequent columns 

df_agg1

如何在不创建这些不必要的重复项的情况下创建新的数据框? 是什么导致它们被创建?

【问题讨论】:

  • 能否举出合适的例子,可以直接使用?
  • 我看不到任何重复项!你能具体说明一下重复是什么意思吗?
  • @mgruber,嗨,如果您参考第二张和第三张图片,您将看到在新数据框中创建的重复项。
  • @AmarboldAltangerel 我的回答(见下文)对您有帮助吗?

标签: python pandas loops dataframe iteration


【解决方案1】:

这是一种不使用for 循环来实现完全相同结果的方法。为了便于阅读,我使用了多行来添加解释。

df = pd.DataFrame([["AB1111",'2018-08-15 00:00:00','164'],
                   ["AB1111",'2018-08-15 00:03:00','564'],
                   ["AB1111",'2018-08-15 00:10:00','364'],
                   ["AB1111",'2018-08-15 00:01:00','564'],
                   ["CD1111",'2018-08-16 00:12:00','514'],
                   ["CD1111",'2018-08-15 00:02:00','564'],
                   ["CD1111",'2018-08-16 00:05:00','564'],
                   ["CD1111",'2018-08-16 00:03:00','64'],
                   ["EF1111",'2018-08-15 00:00:00','534'],
                   ["EF1111",'2018-08-17 00:01:00','564'],
                   ["EF1111",'2018-08-15 00:09:00','524']],
                   columns = ['contract', 'datetime','real_cons'])


df = df.set_index(['datetime','contract']).unstack().add_suffix('_con')
df = df.droplevel(level=0,axis=1) #drops the 'real_cons' index
df = pd.DataFrame(df.to_records()) #workaround the remove multiindex
df['datetime'] = pd.to_datetime(df['datetime']) #change datetime column to datetime datatype
df = df.set_index('datetime').loc['2018-08-15'] #filter data on date

print(df.reset_index())

结果:

             datetime AB1111_con CD1111_con EF1111_con
0 2018-08-15 00:00:00        164        NaN        534
1 2018-08-15 00:01:00        564        NaN        NaN
2 2018-08-15 00:02:00        NaN        564        NaN
3 2018-08-15 00:03:00        564        NaN        NaN
4 2018-08-15 00:09:00        NaN        NaN        524
5 2018-08-15 00:10:00        364        NaN        NaN

【讨论】:

  • 抱歉,我稍微编辑了数据框。数据框有多个列。当我尝试选择特定的“real_cons”列并将其添加到新数据框中时,它是如何工作的?当我尝试您的代码时,它会从不应该从中获取数据的列中获取值。
  • 哪些列应该显示为标题?哪一列应该是值?
  • 抱歉给您添麻烦了。我发现了一种使用 pivot_table 方法的不同方法。谢谢。
最近更新 更多