【问题标题】:filling the missing points in the time series data with pandas.date_range and pandas.reindex python用 pandas.date_range 和 pandas.reindex python 填充时间序列数据中的缺失点
【发布时间】:2016-03-02 05:30:42
【问题描述】:

我正在尝试用 pandas 填充 ascii 文件中时间序列数据中的缺失点。我觉得其他的都还好,但是第一行里面填的是nan,虽然原来有数据。 我的数据样本是:

"2011-08-26 00:00:00",1155179,3.232,23.7,3.281,0.386,25.27,111.5665,28.92,29.83,19.13,0,111.5,13.02,29.77,345.7
"2011-08-26 00:00:30",1155180,3.289,20.44,2.153,0.222,25.25,111.5735,28.94,29.82,19.53,0,111.5,13.02,29.79,342.4
                                    .
                                    .


"2011-08-26 23:59:30",1155297,12.62,28.06,3.162,1.356,24.3,111.4614,28.65,29.84,19.53,0,111.4,13.06,29.50,350.1

我使用的代码如下:

t1 = np.genfromtxt(INPUT,dtype=None,delimiter=',',usecols=[0])
start = t1[0].strip('\'"')
end = t1[-1].strip('\'"')
data=pd.read_csv(INPUT,sep=',',index_col=[0],parse_dates=[0])
index = pd.date_range(start,end,freq="30S")
df = data
sk_f = df.reindex(index)

因此,使用此代码,我想读取第一列的第一个和最后一个字符串,并将它们放入索引以填充指示为 nan 的可能缺失点。但是,问题是第一列也填写了如下结果:

2011-08-26 00:00:00,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan

2011-08-26 00:00:30,1155180,3.289,20.44,2.153,0.222,25.25,111.5735,28.94,29.82,19.53,0,111.5,13.02,29.79,342.4
                                    .
                                    .


2011-08-26 23:59:30,1155297,12.62,28.06,3.162,1.356,24.3,111.4614,28.65,29.84,19.53,0,111.4,13.06,29.50,350.1

这意味着即使原始文件中有数据,第一行却被意外填充。从第二行开始,一切都很好,填补缺失的数据似乎也很好。我试图找出它发生的原因。老实说,我还没找到原因。 任何想法或帮助将不胜感激。 谢谢, 艾萨克

【问题讨论】:

    标签: python datetime pandas missing-data reindex


    【解决方案1】:

    我认为您可以通过genfromtxt 省略读取文件并仅尝试read_csv,然后为reindex 方法找到minmax 日期。

    或使用resample:

    import pandas as pd
    import numpy as np
    import io
    
    temp=u""""2011-08-26 00:00:00",1155179,3.232,23.7,3.281,0.386,25.27,111.5665,28.92,29.83,19.13,0,111.5,13.02,29.77,345.7
    "2011-08-26 00:00:30",1155180,3.289,20.44,2.153,0.222,25.25,111.5735,28.94,29.82,19.53,0,111.5,13.02,29.79,342.4
    "2011-08-26 23:59:30",1155297,12.62,28.06,3.162,1.356,24.3,111.4614,28.65,29.84,19.53,0,111.4,13.06,29.50,350.1"""
    
    #after testing replace io.StringIO(temp) to filename
    df = pd.read_csv(io.StringIO(temp), sep=",", index_col=[0], parse_dates=[0], header=None)
    print df
                              1       2      3      4      5      6         7   \
    0                                                                            
    2011-08-26 00:00:00  1155179   3.232  23.70  3.281  0.386  25.27  111.5665   
    2011-08-26 00:00:30  1155180   3.289  20.44  2.153  0.222  25.25  111.5735   
    2011-08-26 23:59:30  1155297  12.620  28.06  3.162  1.356  24.30  111.4614   
    
                            8      9      10  11     12     13     14     15  
    0                                                                         
    2011-08-26 00:00:00  28.92  29.83  19.13   0  111.5  13.02  29.77  345.7  
    2011-08-26 00:00:30  28.94  29.82  19.53   0  111.5  13.02  29.79  342.4  
    2011-08-26 23:59:30  28.65  29.84  19.53   0  111.4  13.06  29.50  350.1  
    
    start = df.index.min()
    end = df.index.max()
    print start
    2011-08-26 00:00:00
    print end
    2011-08-26 23:59:30
    
    index = pd.date_range(start,end,freq="30S")
    sk_f = df.reindex(index)
    print sk_f.head()
                              1      2      3      4      5      6         7   \
    2011-08-26 00:00:00  1155179  3.232  23.70  3.281  0.386  25.27  111.5665   
    2011-08-26 00:00:30  1155180  3.289  20.44  2.153  0.222  25.25  111.5735   
    2011-08-26 00:01:00      NaN    NaN    NaN    NaN    NaN    NaN       NaN   
    2011-08-26 00:01:30      NaN    NaN    NaN    NaN    NaN    NaN       NaN   
    2011-08-26 00:02:00      NaN    NaN    NaN    NaN    NaN    NaN       NaN   
    
                            8      9      10  11     12     13     14     15  
    2011-08-26 00:00:00  28.92  29.83  19.13   0  111.5  13.02  29.77  345.7  
    2011-08-26 00:00:30  28.94  29.82  19.53   0  111.5  13.02  29.79  342.4  
    2011-08-26 00:01:00    NaN    NaN    NaN NaN    NaN    NaN    NaN    NaN  
    2011-08-26 00:01:30    NaN    NaN    NaN NaN    NaN    NaN    NaN    NaN  
    2011-08-26 00:02:00    NaN    NaN    NaN NaN    NaN    NaN    NaN    NaN  
    
    print df.resample('30S', fill_method='ffill').head()
                              1      2      3      4      5      6         7   \
    0                                                                           
    2011-08-26 00:00:00  1155179  3.232  23.70  3.281  0.386  25.27  111.5665   
    2011-08-26 00:00:30  1155180  3.289  20.44  2.153  0.222  25.25  111.5735   
    2011-08-26 00:01:00  1155180  3.289  20.44  2.153  0.222  25.25  111.5735   
    2011-08-26 00:01:30  1155180  3.289  20.44  2.153  0.222  25.25  111.5735   
    2011-08-26 00:02:00  1155180  3.289  20.44  2.153  0.222  25.25  111.5735   
    
                            8      9      10  11     12     13     14     15  
    0                                                                         
    2011-08-26 00:00:00  28.92  29.83  19.13   0  111.5  13.02  29.77  345.7  
    2011-08-26 00:00:30  28.94  29.82  19.53   0  111.5  13.02  29.79  342.4  
    2011-08-26 00:01:00  28.94  29.82  19.53   0  111.5  13.02  29.79  342.4  
    2011-08-26 00:01:30  28.94  29.82  19.53   0  111.5  13.02  29.79  342.4  
    2011-08-26 00:02:00  28.94  29.82  19.53   0  111.5  13.02  29.79  342.4
    

    【讨论】:

    • 我认为有问题并不清楚,第一行中缺少数据的地方。在read_csv,在reindex?请检查我的解决方案,如果不起作用,我会尝试找出原因。谢谢。
    猜你喜欢
    • 2019-05-16
    • 2015-12-03
    • 1970-01-01
    • 2015-11-21
    • 2013-01-13
    • 1970-01-01
    • 1970-01-01
    • 2020-05-16
    • 2021-10-05
    相关资源
    最近更新 更多