【问题标题】:creating dictionary of dataframes within for-loop throws AssertionError [Pandas for Python]在 for 循环中创建数据帧字典会抛出 AssertionError [Pandas for Python]
【发布时间】:2013-12-17 17:10:32
【问题描述】:

我正在尝试从一个充满 csv 文件的文件夹中构建一组数据框。我首先使用以下 for 循环创建数据帧字典

mydir = os.getcwd()
pdatahistorypath = os.path.join(mydir, pdatahistoryfolder)
currentcsvfilenames = os.listdir(pdatahistorypath)
dframes = {}
for filey in currentcsvfilenames:
    thispath = os.path.join(pdatahistorypath, filey)
    sitedata = pd.read_csv(thispath, header=4)
    sitedata = sitedata.drop('Unnamed: 16', axis=1) # drops waste column
    sitedata['Date'] = pd.to_datetime(sitedata['Date'])
    sitedata.index = sitedata['Date']               # reasign the index to the date column
    dframes[filey[:-4]] = sitedata

然后我使用

将它们拉到面板中
mypanel = pd.Panel(dframes) # create panel

从该面板中,我提取最旧和最晚的日期,将最旧的日期四舍五入到最接近的 20 分钟,以 20 分钟的间隔为该时间跨度创建一个 DateTimeIndex

first_date = mypanel.major_axis[0]
last_date = mypanel.major_axis[-1]             # the very last date in series
multiplier = (1e9)*60*20                       # round (floor) to 20 minute interval
t3 = first_date.value - first_date.value % multiplier
idx = pd.date_range(t3, last_date, freq="20min")
df = dframes['Naka-1.csv']

然后,我正在尝试将我的不规则时间戳数据重新索引到我之前创建的 20 分钟间隔系列,idx

df2 = df.reindex(idx)

问题是,我收到以下错误

Traceback (most recent call last):
  File "C:/Users/ble1usb/Dropbox/Git/ers-dataanalyzzer/pandasdfmaker.py", line 50, in <module>
    df2 = df.reindex(idx)#, method=None)#, method='pad', limit=None) # reindex to the datetimeindex built from first/last dates
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2625, in reindex
    fill_value, limit, takeable)
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2707, in _reindex_index
    copy, fill_value)
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2723, in _reindex_with_indexers
    fill_value=fill_value)
  File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 1985, in reindex_indexer
    return BlockManager(new_blocks, new_axes)
  File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 1001, in __init__
    self._verify_integrity()
  File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 1236, in _verify_integrity
    raise AssertionError("Block ref_items must be BlockManager "
AssertionError: Block ref_items must be BlockManager items

在调试此问题时,我发现以下工作正常。我试图重新创建我能想到的所有差异,除了在循环内创建的数据框

dframes = {}
dfpath = 'C:\Users\\ble1usb\Dropbox\Git\ers-dataanalyzzer\datahistoryPandas\Naka-1.csv'
sitedata = pd.read_csv(dfpath, header=4)
sitedata = sitedata.drop('Unnamed: 16', axis=1) # drops waste column
sitedata['Date'] = pd.to_datetime(sitedata['Date'])
sitedata.index = sitedata['Date']  # reasign the index to the date column
dframes['Naka-1'] = sitedata
dframes['myOtherSite'] = sitedata[sitedata['Out ppm'] > 3]
mypanel = pd.Panel(dframes)
first_date = mypanel.major_axis[0]
last_date = mypanel.major_axis[-1]             # the very last date in series
multiplier = (1e9)*60*20                       # round (floor) to 20 minute interval
t3 = first_date.value - first_date.value % multiplier
idx = pd.date_range(t3, last_date, freq="20min")
df = dframes['Naka-1.csv']
df2 = df.reindex(idx)

这是上一个代码块的输出(我正在丢失一些数据以进行四舍五入,稍后解决)

>> print df2.tail(15)

                                   Date   Status Alarms Present    RPM  Hours  Oil Pres. Out ppm Ratio   In  Out  Inlet psi  Bag psi  Disch. psi  Hi Pres  Coolant Temp    Comm
2013-12-10 16:40:00                 NaT      NaN            NaN    NaN    NaN        NaN     NaN   NaN  NaN  NaN        NaN      NaN         NaN      NaN           NaN     NaN
2013-12-10 17:00:00 2013-12-10 17:00:00  Running           none  2,820  9,384         53       0     0  469  473       5.56     0.72       268.1        0             1  Normal
2013-12-10 17:20:00                 NaT      NaN            NaN    NaN    NaN        NaN     NaN   NaN  NaN  NaN        NaN      NaN         NaN      NaN           NaN     NaN
2013-12-10 17:40:00                 NaT      NaN            NaN    NaN    NaN        NaN     NaN   NaN  NaN  NaN        NaN      NaN         NaN      NaN           NaN     NaN
2013-12-10 18:00:00 2013-12-10 18:00:00  Running           none  2,820  9,385         54       0     0  462  470      12.28     0.82       259.1        0             1  Normal
2013-12-10 18:20:00                 NaT      NaN            NaN    NaN    NaN        NaN     NaN   NaN  NaN  NaN        NaN      NaN         NaN      NaN           NaN     NaN
2013-12-10 18:40:00                 NaT      NaN            NaN    NaN    NaN        NaN     NaN   NaN  NaN  NaN        NaN      NaN         NaN      NaN           NaN     NaN
2013-12-10 19:00:00                 NaT      NaN            NaN    NaN    NaN        NaN     NaN   NaN  NaN  NaN        NaN      NaN         NaN      NaN           NaN     NaN
2013-12-10 19:20:00                 NaT      NaN            NaN    NaN    NaN        NaN     NaN   NaN  NaN  NaN        NaN      NaN         NaN      NaN           NaN     NaN
2013-12-10 19:40:00                 NaT      NaN            NaN    NaN    NaN        NaN     NaN   NaN  NaN  NaN        NaN      NaN         NaN      NaN           NaN     NaN
2013-12-10 20:00:00 2013-12-10 20:00:00  Running           none  2,880  9,387         55       0     0  450  456      10.91     0.73       249.9        0             1  Normal
2013-12-10 20:20:00                 NaT      NaN            NaN    NaN    NaN        NaN     NaN   NaN  NaN  NaN        NaN      NaN         NaN      NaN           NaN     NaN
2013-12-10 20:40:00                 NaT      NaN            NaN    NaN    NaN        NaN     NaN   NaN  NaN  NaN        NaN      NaN         NaN      NaN           NaN     NaN
2013-12-10 21:00:00 2013-12-10 21:00:00  Running           none  2,820  9,388         54       0     0  440  449       8.16     0.62       243.1        0             1  Normal
2013-12-10 21:20:00                 NaT      NaN            NaN    NaN    NaN        NaN     NaN   NaN  NaN  NaN        NaN      NaN         NaN      NaN           NaN     NaN

所以,我知道它应该可以工作。我想不出还有什么会导致这个断言错误。

有什么我可以尝试的吗?

【问题讨论】:

    标签: python csv pandas


    【解决方案1】:

    您应该使用 resample 而不是使用 date_range 重新索引:

    idx = pd.date_range(t3, last_date, freq="20min")
    df2 = df.reindex(idx)
    

    可能是:

    df.resample('20min', 'last')
    

    【讨论】:

    • 问题是许多数据帧在开头和结尾都会有“死区”。虽然一个 DF 可能有 6 月至 8 月的数据,但会将其与 4 月至 12 月的数据进行比较。我相信我需要面板中的每个数据框都具有相同的索引
    • @BrianLeach 我想这取决于你以后要做什么,通常熊猫在对齐方面非常聪明......你试过panel resample method吗?
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-11-27
    • 1970-01-01
    • 1970-01-01
    • 2016-10-28
    相关资源
    最近更新 更多