对多个 CSV 文件进行重采样并使用新名称自动保存重采样文件答案

【问题标题】：Resampling Multiple CSV Files and Automatically Saving Resampled Files with New Names对多个 CSV 文件进行重采样并使用新名称自动保存重采样文件
【发布时间】：2017-04-08 12:09:38
【问题描述】：

我已经努力了一个多星期来解决这个问题，但我似乎找不到解决方案。一些编码人员在帮助方面表现出色，但不幸的是，还没有人提供对我有用的建议。我将尝试尽可能简单地提出相同的问题。

我有很多（超过 100 个）csv 文件。所有 csv 文件都将“日期时间”作为其第一列。 “日期时间”格式为“YYYY-MM-DD HH:MM:SS”。每个文件在整个月内每 15 分钟提供一次数据行（大量数据行）。所有 csv 文件都位于三个单独的文件夹中，每个文件夹的路径如下：

“C:\Users\Documents\SummaryData\24Hour”

“C:\Users\Documents\SummaryData\Daytime”

“C:\Users\Documents\SummaryData\Nighttime”

24 Hour 文件夹中的 csv 文件跨越 24 小时的时间范围。对于 MM:SS，Daytime 文件夹中的 csv 文件跨度为 06:00 - 18:00。 Nighttime 文件夹中的 csv 文件跨度为 18:00 - 06:00（MM:SS）。

例如，存在 2015 年 8 月的 csv 文件。对于这个月，在 24 Hour 文件夹中，我们有一个 csv 文件，它提供了整个 8 月的不间断 15 分钟间隔数据2015.

对于同一月份和年份，Daytime 文件夹中存在另一个 csv 文件，该文件仅提供 06:00 - 18:00 时间的数据。例如，请参见下面的文件的 sn -p。我随机选择提供从 8 月 12 日开始的数据。

例如，进一步进入月份：

Nighttime 也存在相同的文件，但跨越整个晚上的时间。

请注意，存在比上图中显示的列更多的列。

在保留这些原始的 15 分钟间隔文件的同时，我需要重新采样所有 csv 文件，以便每个文件都有自己的 Hourly、Daily 和 Monthly 文件。棘手的部分是，我希望某些列在重采样时间范围内求和，而其他列需要在时间范围内求平均。

因此，如果我要对当天的数据进行重新采样，我需要一些列来平均当天的数据，而其他列则汇总当天的数据。但尽管如此，我需要一个从这些原始的 15 分钟间隔 csv 文件创建的每日 csv 文件。但是，在所有文件中，具有相同标题名称的列需要相同的重新采样（因此，如果需要在一天内对 column["windspeed"] 进行平均，那么对于另一个 csv 文件中的 column["windspeed"] 来说，这将是相同的） .

另一个棘手的部分是，我还需要将这些文件导出为 csv 文件（到任何输出位置，例如“C:\Users\cp_vm\Documents\Output”）并自动重命名以表示它们是如何被重新采样的.

因此，以 2015 年 8 月的 csv 文件为例，该文件当前名为：

"2015August.csv",

如果我将此文件重新采样为每小时、每天和每月，我希望将所有这些重新采样的新 csv 文件保存为：

“2015AugustHourly.csv”和；

“2015AugustDaily.csv”和；

分别为“2015AugustMonthly.csv”。

我知道我需要使用某种形式的“for 循环”，而且我确实尝试过。但我无法弄清楚这一点。任何帮助将不胜感激！并感谢所有已经提供建议的人。

下面的输出示例显示了几个小时内的平均值：

下面的示例显示了一些附加列（SR_Gen 和 SR_All），它们是在几个小时内对 15 分钟数据求和的结果。

【问题讨论】：

您能否提供一个示例 .jpeg 来说明您要查找的内容（以 2016 年 8 月 29 日时间为例）？这将确保我正确理解目标。
@NickBraunagel 我已经包括（上）每小时重新采样数据输出应该是什么样子的示例。我目前所做的是创建一个单独的 csv（2 行 x n 列），其中列出了第 1 行中的列，而在第 2 行中，列出了重新采样的操作。因此，第 2 行由 ["mean"、"mean"、"mean"、"sum"] 等组成。然后我将两个列表（第 1 行和第 2 行）转换为字典：
您是否有权访问任何关系数据库服务器级别（即 MySQL、Postgres）、文件级别（即 SQLite、MS Access）？如果是这样，请导入 csvs 并按月/日/小时分组运行聚合。
好的，感谢您的输出说明。另外，另一个问题：对于 MONTHLY 重新采样，您是否试图找到一个月时间范围内白天和黑夜的月平均值，或者只是月平均值，无论白天/黑夜？例如，您想知道 Engine1 在一个月内的平均速度，还是 Engine1 在一个月内夜间和白天的平均速度？
@NickBraunagel 我想知道后者。即，发动机 1 在一个月内、夜间和白天的平均速度。谢谢。

标签： python csv datetime for-loop export-to-csv

【解决方案1】：

我认为您可以重复使用我们之前工作中的代码 (here)。使用原始代码，在创建 NIGHT 和 DAY 数据帧后，您可以按小时、每天和每月对它们重新采样，并将新的（重新采样的）数据帧保存为您喜欢的 .csv 文件。

我将使用一个示例数据框（此处显示前 3 行）：

dates               PRp         PRe         Norm_Eff    SR_Gen      SR_All
2016-01-01 00:00:00 0.269389    0.517720    0.858603    8123.746453 8770.560467
2016-01-01 00:15:00 0.283316    0.553203    0.862253    7868.675481 8130.974409
2016-01-01 00:30:00 0.286590    0.693997    0.948463    8106.217144 8314.584848

完整代码

import pandas as pd
import datetime
from dateutil.relativedelta import relativedelta
from random import randint
import random
import calendar

# I defined a sample dataframe with dummy data
start = datetime.datetime(2016,1,1,0,0)
r = range(0,10000)

dates = [start + relativedelta(minutes=15*i) for i in r]
PRp = [random.uniform(.2, .3) for i in r]
PRe = [random.uniform(0.5, .7) for i in r]
Norm_Eff = [random.uniform(0.7, 1) for i in r]
SR_Gen = [random.uniform(7500, 8500) for i in r]
SR_All = [random.uniform(8000, 9500) for i in r]

DF = pd.DataFrame({
        'dates': dates,
        'PRp': PRp,
        'PRe': PRe,
        'Norm_Eff': Norm_Eff,
        'SR_Gen': SR_Gen,
        'SR_All': SR_All,
    })



# define when day starts and ends (MUST USE 24 CLOCK)
day = {
        'start': datetime.time(6,0),  # start at 6am (6:00)
        'end': datetime.time(18,0)  # ends at 6pm (18:00)
      }


# capture years that appear in dataframe
min_year = DF.dates.min().year
max_year = DF.dates.max().year

if min_year == max_year:
    yearRange = [min_year]
else:
    yearRange = range(min_year, max_year+1)

# iterate over each year and each month within each year
for year in yearRange:
    for month in range(1,13):

        # filter to show NIGHT and DAY dataframe for given month within given year
        NIGHT = DF[(DF.dates >= datetime.datetime(year, month, 1)) & 
           (DF.dates <= datetime.datetime(year, month, 1) + relativedelta(months=1) - relativedelta(days=1)) & 
           ((DF.dates.apply(lambda x: x.time()) <= day['start']) | (DF.dates.apply(lambda x: x.time()) >= day['end']))]

        DAY = DF[(DF.dates >= datetime.datetime(year, month, 1)) & 
           (DF.dates <= datetime.datetime(year, month, 1) + relativedelta(months=1) - relativedelta(days=1)) & 
           ((DF.dates.apply(lambda x: x.time()) > day['start']) & (DF.dates.apply(lambda x: x.time()) < day['end']))]

        # Create resampled dataframes on Hourly, Daily, Monthly basis
        for resample_freq, freq_tag in zip(['H','D','M'], ['Hourly','Daily','Monthly']):

            NIGHT.index = NIGHT.dates                           # resampled column must be placed in index
            NIGHT_R = pd.DataFrame(data={
                    'PRp': NIGHT.PRp.resample(rule=resample_freq).mean(),            # averaging data
                    'PRe': NIGHT.PRe.resample(rule=resample_freq).mean(),
                    'Norm_Eff': NIGHT.Norm_Eff.resample(rule=resample_freq).mean(),
                    'SR_Gen': NIGHT.SR_Gen.resample(rule=resample_freq).sum(),        # summing data
                    'SR_All': NIGHT.SR_All.resample(rule=resample_freq).sum()  
                })
            NIGHT_R.dropna(inplace=True)  # removes the times during 'day' (which show as NA)

            DAY.index = DAY.dates
            DAY_R = pd.DataFrame(data={
                    'PRp': DAY.PRp.resample(rule=resample_freq).mean(),
                    'PRe': DAY.PRe.resample(rule=resample_freq).mean(),
                    'Norm_Eff': DAY.Norm_Eff.resample(rule=resample_freq).mean(),
                    'SR_Gen': DAY.SR_Gen.resample(rule=resample_freq).sum(),        
                    'SR_All': DAY.SR_All.resample(rule=resample_freq).sum()  
                })
            DAY_R.dropna(inplace=True)  # removes the times during 'night' (which show as NA)

            # save to .csv with date and time in file name
            # specify the save path of your choice
            path_night = 'C:\\Users\\nickb\\Desktop\\stackoverflow\\{0}{1}_NIGHT_{2}.csv'.format(year, calendar.month_name[month], freq_tag)
            path_day = 'C:\\Users\\nickb\\Desktop\\stackoverflow\\{0}{1}_DAY_{2}.csv'.format(year, calendar.month_name[month], freq_tag)

            # some of the above NIGHT_R / DAY_R filtering will return no rows.
            # Check for this, and only save if the dataframe contains rows
            if NIGHT_R.shape[0] > 0:
                NIGHT_R.to_csv(path_night, index=True)
            if DAY_R.shape[0] > 0:
                DAY_R.to_csv(path_day, index=True)

以上将导致每月总共有六个.csv 文件：

白天按小时计算
白天的每日基准
白天的每月基准
夜间按小时计算
夜间每日基准
夜间的每月基准

每个文件的文件名如下：（年）（月名）（日/夜）（频率）。例如：2016August_NIGHT_Daily

让我知道以上是否实现了目标。

此外，这里是可用的resample 频率列表，您可以从中选择：pandas resample documentation

【讨论】：

感谢@NickBraunagel 并为我迟到的回复道歉。我也在度假。请看我上面的回复。

【解决方案2】：

@NickBraunagel 衷心感谢您花时间回答这个问题。我为我迟到的回复道歉。我也在度假，我才刚回来。您的代码看起来非常好，并且可能比我自己的代码更有效率。一旦工作安静下来，我会运行它，看看是否是这种情况。但是，在等待回复时，我设法自己解决了这个问题。我已经上传了下面的代码。

为了避免写出每列名称以及在重新采样时间段内是“表示”还是“求和”数据，我手动创建了另一个 Excel 文档，其中列出了第 1 行中的列标题并列出了“平均值”或标题下方的“总和”（n*columns x 2 行），然后我将此 csv 转换为字典并在重新采样代码中引用它。见下文。

另外，我导入已经是 24Hour、Daytime 和 Nighttime 文件的数据，然后重新采样。

import pandas as pd
import glob

#project specific paths - comment (#) all paths not relevant

#read in manually created re-sampling csv file to reference later as a dictionary in the re-sampling code
#the file below consists of n*columns x 2 rows, where row 1 is the column headers and row 2 specifies whether that column is to be averaged ('mean') or summed ('sum') over the re-sampling time period
f =pd.read_csv('C:/Users/cp_vm/Documents/ResampleData/AllData.csv')

#convert manually created resampling csv to dictionary ({'columnname': resample,'columnname2': resample2)}
recordcol = list(f.columns.values)
recordrow = f.iloc[0,:]
how_map=dict(zip(recordcol,recordrow))
what_to_do = dict(zip(f.columns, [how_map[x] for x in recordcol]))

#this is not very efficient, but for the time being, comment (#) all paths not relevant
#meaning run the script multiple times, each time changing the in' and outpaths
#read in datafiles via their specific paths: order - AllData 24Hour, AllData DayTime, AllData NightTime
inpath = r'C:/Users/cp_vm/Documents/Data/Input/AllData/24Hour/'
outpath = 'C:/Users/cp_vm/Documents/Data/Output/AllData/24Hour/{0}_{1}_{2}_AllData_24Hour.csv'

#inpath = r'C:/Users/cp_vm/Documents/Data/Input/AllData/Daytime/'
#outpath = 'C:/Users/cp_vm/Documents/Data/Output/AllData/Daytime/{0}_{1}_{2}_AllData_Daytime.csv'

#inpath = r'C:/Users/cp_vm/Documents/Data/Input/AllData/Nighttime/'
#outpath = 'C:/Users/cp_vm/Documents/Data/Output/AllData/Nighttime/{0}_{1}_{2}_AllData_Nighttime.csv'

allFiles = glob.glob(inpath + "/*.csv")

#resample all incoming files to be hourly-h, daily-d, or monthly-m and export with automatic naming of files
for files_ in allFiles:
    #read in all files_
    df = pd.read_csv(files_,index_col = None, parse_dates = ['Datetime'])
    df.index = pd.to_datetime(df.Datetime)
    #change Datetime column to be numeric, so it can be resampled without being removed
    df['Datetime'] = pd.to_numeric(df['Datetime'])
    #specify year and month for automatic naming of files
    year = df.index.year[1]
    month = df.index.month[1]
    #comment (#) irrelevant resamplping, so run it three times, changing h, d and m
    resample = "h"
    #resample = "d"
    #resample = "m"
    #resample df based on the dictionary defined by what_to_do and resample - please note that 'Datetime' has the resampling 'min' associated to it in the manually created re-sampling csv file
    df = df.resample(resample).agg(what_to_do)
    #drop rows where all column values are non existent
    df = df.dropna(how='all')
    #change Datetime column back to datetime.datetime format
    df.Datetime = pd.to_datetime(df.Datetime)
    #make datetime column the index
    df.index = df.Datetime
    #move datetime column to the front of dataframe
    cols = list(df.columns.values)
    cols.pop(cols.index('Datetime'))
    df = df[['Datetime'] + cols]
    #export all files automating their names dependent on their datetime
    #if the dataframe has any rows, then export it
    if df.shape[0] > 0:
        df.to_csv(outpath.format(year,month,resample), index=False)

【讨论】：

好的，很酷 - 很高兴您能够获得所需的输出！除非您另有说明，否则我会假设您很适合。