【问题标题】:Reading in specific date lines from a file with pandas python使用 pandas python 从文件中读取特定日期行
【发布时间】:2016-03-01 18:53:35
【问题描述】:

我正在尝试读取许多文件。每个文件是一个每日数据文件,每 10 分钟有一次数据。每个文件中的数据有点像这样“分块”:

2015-11-08 00:10:00 00:10:00
#    z  speed    dir      W   sigW       bck   error 
30   3.32  111.9   0.15   0.12  1.50E+05       0
40   3.85  108.2   0.07   0.14  7.75E+04       0
50   4.20  107.9   0.06   0.15  4.73E+04       0
60   4.16  108.5   0.03   0.19  2.73E+04       0
70   4.06   93.6   0.03   0.23  9.07E+04       0
80   4.06   93.8   0.07   0.28  1.36E+05       0

2015-11-08 00:20:00 00:10:00
#    z  speed    dir      W   sigW       bck   error 
30   3.79  120.9   0.15   0.11  7.79E+05       0
40   4.36  115.6   0.04   0.13  2.42E+05       0
50   4.71  113.6   0.07   0.14  6.84E+04       0
60   5.00  113.3   0.13   0.17  1.16E+04       0
70   4.29   94.2   0.22   0.20  1.38E+05       0
80   4.54   94.1   0.11   0.25  1.76E+05       0

2015-11-08 00:30:00 00:10:00
#    z  speed    dir      W   sigW       bck   error 
30   3.86  113.6   0.13   0.10  2.68E+05       0
40   4.34  116.1   0.09   0.11  1.41E+05       0
50   5.02  112.8   0.04   0.12  7.28E+04       0
60   5.36  110.5   0.01   0.14  5.81E+04       0
70   4.67   95.4   0.14   0.16  7.69E+04       0
80   4.56   95.0   0.15   0.21  9.84E+04       0

...

文件每 10 分钟就这样持续一整天。此文件的文件名为 151108.mnd。我希望我的代码读取所有 11 月的文件,所以 1511??.mnd 并且我希望我的代码在整个月的每一天文件中读取所有日期时间行,因此对于我刚刚展示的部分数据文件示例我希望我的代码将 2015-11-08 00:10:00、2015-11-08 00:20:00、2015-11-08 00:30:00 等存储为变量,然后转到次日文件 (151109.mnd) 并获取所有日期时间行并存储为日期变量并附加到先前存储的日期。整个月以此类推。这是我到目前为止的代码:

import pandas as pd
import glob
import datetime

filename = glob.glob('1511??.mnd')
data_nov15_hereford = pd.DataFrame()
frames = []
dates = []
counter = 1
for i in filename:
    f_nov15_hereford = pd.read_csv(i, skiprows = 32)
    for line in f_nov15_hereford:
        if line.startswith("20"):
            print line
            date_object = datetime.datetime.strptime(line[:-6], '%Y-%m-%d %H:%M:%S %f')
            dates.append(date_object)
            counter = 0
        else:
            counter += 1 
    frames.append(f_nov15_hereford) 
data_nov15_hereford = pd.concat(frames,ignore_index=True)
data_nov15_hereford = data_nov15_hereford.convert_objects(convert_numeric=True)


print dates

这段代码有一些问题,因为当我打印日期时,它会打印出每个日期的两个副本,而且它也只打印出每个文件的第一个日期,所以 2015-11-08 00:10:00, 2015-11-09 00:10:00 等。它不会在每个文件中逐行运行,然后一旦存储该文件中的所有日期,就可以像我想要的那样移动到下一个文件。相反,它只是抓取每个文件中的第一个日期。对此代码有任何帮助吗?有没有更简单的方法来做我想做的事?谢谢!

【问题讨论】:

    标签: python file datetime pandas


    【解决方案1】:

    一些观察:

    第一:为什么你只得到文件中的第一个日期:

    f_nov15_hereford = pd.read_csv(i, skiprows = 32)
    for line in f_nov15_hereford:
        if line.startswith("20"):
    

    第一行将文件读入 pandas 数据帧。第二行遍历数据框的列,而不是行。结果,最后一行检查该列是否以“20”开头。每个文件只发生一次。

    第二个:counter 被初始化并且它的值被改变了,但它从未被使用过。我想它是用来跳过文件中的行。

    第三:将所有日期收集到 Python 列表中,然后在需要时将其转换为 pandas 数据框可能会更简单。

    import pandas as pd
    import glob
    import datetime as dt
    
    # number of lines to skip before the first date
    offset = 32
    
    # number of lines from one date to the next
    recordlength = 9
    
    pattern = '1511??.mnd'
    
    dates = []
    
    for filename in glob.iglob(pattern):
    
        with open(filename) as datafile:
    
            count = -offset
            for line in datafile:
                if count == 0:
                    fmt = '%Y-%m-%d %H:%M:%S %f'
                    date_object = dt.datetime.strptime(line[:-6], fmt)
                    dates.append(date_object)
    
                count += 1 
    
                if count == recordlength:
                    count = 0
    
    data_nov15_hereford = pd.DataFrame(dates, columns=['Dates'])
    
    print dates
    

    【讨论】:

    • 这似乎很好用!我唯一的抱怨是,当我打印日期时,它仍然给了我 2 套。或者如果我打印 np.shape(dates) 我得到两个形状 (2046L,) (2046L,)
    • 没关系,我认为这是我的笔记本问题而不是代码问题!非常感谢!
    【解决方案2】:

    考虑在作为数据帧读入之前逐行修改 csv 数据。下面打开 glob 列表中的原始文件并写入临时文件,将日期移到最后一列,删除多个标题和空行。

    CSV数据(假设csv文件的文本视图如下;如果与实际不同,请调整py代码)

    2015-11-0800:10:0000:10:00,,,,,,
    z,speed,dir,W,sigW,bck,error
    30,3.32,111.9,0.15,0.12,1.50E+05,0
    40,3.85,108.2,0.07,0.14,7.75E+04,0
    50,4.2,107.9,0.06,0.15,4.73E+04,0
    60,4.16,108.5,0.03,0.19,2.73E+04,0
    70,4.06,93.6,0.03,0.23,9.07E+04,0
    80,4.06,93.8,0.07,0.28,1.36E+05,0
    ,,,,,,
    2015-11-0800:10:0000:20:00,,,,,,
    z,speed,dir,W,sigW,bck,error
    30,3.79,120.9,0.15,0.11,7.79E+05,0
    40,4.36,115.6,0.04,0.13,2.42E+05,0
    50,4.71,113.6,0.07,0.14,6.84E+04,0
    60,5,113.3,0.13,0.17,1.16E+04,0
    70,4.29,94.2,0.22,0.2,1.38E+05,0
    80,4.54,94.1,0.11,0.25,1.76E+05,0
    ,,,,,,
    2015-11-0800:10:0000:30:00,,,,,,
    z,speed,dir,W,sigW,bck,error
    30,3.86,113.6,0.13,0.1,2.68E+05,0
    40,4.34,116.1,0.09,0.11,1.41E+05,0
    50,5.02,112.8,0.04,0.12,7.28E+04,0
    60,5.36,110.5,0.01,0.14,5.81E+04,0
    70,4.67,95.4,0.14,0.16,7.69E+04,0
    80,4.56,95,0.15,0.21,9.84E+04,0
    

    Python脚本

    import glob, os
    import pandas as pd
    
    filenames = glob.glob('1511??.mnd')
    temp = 'temp.csv'
    
    # INITIATE EMPTY DATAFRAME
    data_nov15_hereford = pd.DataFrame(columns=['z', 'speed', 'dir', 'W', 
                                                'sigW', 'bck', 'error', 'date'])
    
    # ITERATE THROUGH EACH FILE IN GLOB LIST
    for file in filenames:        
        # DELETE PRIOR TEMP VERSION                    
        if os.path.exists(temp): os.remove(temp)
    
        header = 0
        # READ IN ORIGINAL CSV
        with open(file, 'r') as txt1:
            for rline in txt1:
                # SAVE DATE VALUE THEN SKIP ROW
                if "2015-11" in rline: date = rline.replace(',',''); continue
    
                # SKIP BLANK LINES (CHANGE IF NO COMMAS)               
                if rline == ',,,,,,\n': continue
    
                # ADD NEW 'DATE' COLUMN AND SKIP OTHER HEADER LINES
                if 'z,speed,dir,W,sigW,bck,error' in rline:
                    if header == 1: continue
                    rline = rline.replace('\n', ',date\n')
                    with open(temp, 'a') as txt2:
                        txt2.write(rline)
                    continue
                header = 1
    
                # APPEND LINE TO TEMP CSV WITH DATE VALUE
                with open(temp, 'a') as txt2:
                    txt2.write(rline.replace('\n', ','+date))
    
        # APPEND TEMP FILE TO DATA FRAME
        data_nov15_hereford = data_nov15_hereford.append(pd.read_csv(temp))
    

    输出

         z  speed    dir     W  sigW     bck  error                        date
    0   30   3.32  111.9  0.15  0.12  150000      0  2015-11-0800:10:0000:10:00
    1   40   3.85  108.2  0.07  0.14   77500      0  2015-11-0800:10:0000:10:00
    2   50   4.20  107.9  0.06  0.15   47300      0  2015-11-0800:10:0000:10:00
    3   60   4.16  108.5  0.03  0.19   27300      0  2015-11-0800:10:0000:10:00
    4   70   4.06   93.6  0.03  0.23   90700      0  2015-11-0800:10:0000:10:00
    5   80   4.06   93.8  0.07  0.28  136000      0  2015-11-0800:10:0000:10:00
    6   30   3.79  120.9  0.15  0.11  779000      0  2015-11-0800:10:0000:20:00
    7   40   4.36  115.6  0.04  0.13  242000      0  2015-11-0800:10:0000:20:00
    8   50   4.71  113.6  0.07  0.14   68400      0  2015-11-0800:10:0000:20:00
    9   60   5.00  113.3  0.13  0.17   11600      0  2015-11-0800:10:0000:20:00
    10  70   4.29   94.2  0.22  0.20  138000      0  2015-11-0800:10:0000:20:00
    11  80   4.54   94.1  0.11  0.25  176000      0  2015-11-0800:10:0000:20:00
    12  30   3.86  113.6  0.13  0.10  268000      0  2015-11-0800:10:0000:30:00
    13  40   4.34  116.1  0.09  0.11  141000      0  2015-11-0800:10:0000:30:00
    14  50   5.02  112.8  0.04  0.12   72800      0  2015-11-0800:10:0000:30:00
    15  60   5.36  110.5  0.01  0.14   58100      0  2015-11-0800:10:0000:30:00
    16  70   4.67   95.4  0.14  0.16   76900      0  2015-11-0800:10:0000:30:00
    17  80   4.56   95.0  0.15  0.21   98400      0  2015-11-0800:10:0000:30:00
    

    【讨论】:

    • 这很有用!谢谢!
    猜你喜欢
    • 2019-02-14
    • 2021-04-19
    • 2020-07-13
    • 2014-12-05
    • 2019-08-12
    • 1970-01-01
    • 2020-01-09
    • 1970-01-01
    • 2020-01-16
    相关资源
    最近更新 更多