执行数据帧的熊猫连接并从文件中读取它答案

【问题标题】：Perform pandas concatenation of dataframe and read it from a file执行数据帧的熊猫连接并从文件中读取它
【发布时间】：2018-02-17 12:18:46
【问题描述】：

我有一个用例，我需要创建一个带有年份和月份的 python 字典，然后将所有数据帧连接到单个数据帧。我已经完成了如下实现：

dict_year_month = {}
temp_dict_1={}
temp_dict_2={}  
for ym in [201104,201105 ... 201706]:

    key_name = 'df_'+str(ym)+'A'
        temp_dict_1[key_name]=df[(df['col1']<=ym) & (df['col2']>ym)
                                      & (df['col3']==1)]

        temp_dict_2[key_name]=df[(df['col1']<=ym) & (df['col2']==0)
                                     & (df['col3']==1)]

        if not temp_dict_1[key_name].empty:
            dict_year_month [key_name] =temp_dict_1[key_name]
            dict_year_month [key_name].loc[:, 'new_col'] = ym
        elif not temp_dict_2[key_name].empty:
            dict_year_month [key_name] =temp_dict_2[key_name]
            dict_year_month [key_name].loc[:, 'new_col'] = ym

        dict_year_month [key_name]=dict_year_month [key_name].sort_values('col4')
        dict_year_month [key_name]=dict_year_month [key_name].drop_duplicates('col5') 
   .. do some other processing 
   create individual dataframes as df_201104A .. and so on ..
dict_year_month
#concatenate all the above individual dataframe into single dataframe:
df1 = pd.concat([
           dict_year_month['df_201104A'],dict_year_month['df_201105A'],
           ... so on till dict_year_month['df_201706A'])

现在的挑战是我必须在每个季度重新运行一组代码，所以每次我必须使用新的 yearmonths dict 键和 pd.concat 更新此脚本时，还需要使用新年月份的详细信息进行更新。我正在寻找其他一些解决方案，通过它我可以读取键并从属性文件或配置文件中创建连接的数据框列表？

【问题讨论】：

标签： python python-3.x pandas dictionary dataframe

【解决方案1】：

您只需要做几件事即可到达那里 - 首先是枚举开始和结束月份之间的月份，我在下面使用 rrule 执行此操作，从文件中读取开始日期和结束日期.这将为您提供字典的键。然后只需使用字典上的.values() 方法来获取所有数据帧。

from dateutil import rrule
from datetime import datetime, timedelta
import pickle

#get these from whereever, config, etc.
params = {
    'start_year':2011,
    'start_month':4,
    'end_year':2017,
    'end_month':6,
}

pickle.dump(params, open("params.pkl", "wb"))

params = pickle.load(open("params.pkl", "rb"))

start = datetime(year=params['start_year'], month=params['start_month'], day=1)
end = datetime(year=params['end_year'], month=params['end_month'], day=1)

keys = [int(dt.strftime("%Y%m")) for dt in rrule.rrule(rrule.MONTHLY, dtstart=start, until=end)]
print(keys)    
## Do some things and get a dict
dict_year_month = {'201104':pd.DataFrame([[1, 2, 3]]), '201105':pd.DataFrame([[4, 5, 6]])} #... etc

pd.concat(dict_year_month.values())

pickle 文件显示了一种保存和加载参数的方式 - 它是一种二进制格式，因此手动编辑参数不会真正起作用。您可能想调查 yaml 之类的内容以获得更复杂的信息，如果您需要帮助，请随时提出新问题。

【讨论】：

感谢 ken，在生成密钥之前它有点清晰，但是在再次创建字典后，您正在硬编码 '201104' 等.. 接下来我必须再次使用这个脚本来追加新年不要'你也这么认为吗？
@user07 这只是我提供了一些示例数据，您将在其中获得用于生成数据帧字典的代码，我不知道您在做什么，所以只是跳过它并创建了一个虚拟的东西在最后展示了如何做concat
Ken，有什么方法可以使用 dict_year_month.values() 而不是 pd.concat(dict_year_month.values()) 创建 pyspark 数据框？假设 dict_year_month 已创建 pyspark 数据框。