【发布时间】:2015-08-27 05:05:33
【问题描述】:
我正在尝试查找并计算 HDF5 文件中的数据。该文件有一个名为 IAS 的数据列。奇怪的是,我以两种不同的方式得到了完全不同的答案,我认为这应该是等效的,并且应该给出相同的结果。这是第一版,我只在循环内测试 > 120:
import pandas as pd
import numpy as np
store = pd.HDFStore('AllData.h5','r')
a=store.select('mydata',iterator=True,chunksize=50000)
bins=np.arange(0.00,1.5,.02)
#Fill in a blank counting Series:
counts_total= pd.value_counts(pd.cut([],bins,include_lowest=True))
for chunk in a:
counts=pd.value_counts(pd.cut(abs(chunk[(chunk['IAS']>120.0)]['LatAc']),bins,include_lowest=True))
counts_total=counts+counts_total
虽然在这种情况下,我使用 select 来查找我想要的那些,然后在循环中计算它们。第二个版本运行得更快,但似乎只返回一个块。我不确定哪一个是正确的,或者他们为什么给出不同的答案。
import pandas as pd
import numpy as np
store = pd.HDFStore('AllData.h5','r')
a=store.select('mydata',iterator=True,chunksize=50000,where='IAS > 120.0')
bins=np.arange(0.00,1.5,.02)
#Fill in a blank counting Series:
counts_total2= pd.value_counts(pd.cut([],bins,include_lowest=True))
for chunk in a:
counts=pd.value_counts(pd.cut(abs(chunk['LatAc']),bins,include_lowest=True))
counts_total2=counts+counts_total2
如果我只看返回的行数,第二个版本只返回单个块中的所有行(数据文件为 280 万行)。
第二个版本有什么问题吗?我认为我正确使用了“where”,但是它没有给出预期的结果。
而且,只是为了澄清我的主要目标。在 IAS 超过 120 的情况下,我正在尝试合并我的 LatAC 列数据,并忽略其余数据。
添加更多信息。我正在使用 PyTables 3.2.1(我在 Pandas IO 工具页面上看到了有关先前版本中的索引错误的警告)。
我正在读取的数据由我解析并附加到 HDF5 文件的多个 csv 文件组成。似乎第二种方法,使用“where”只返回最后几个附加文件的数据,而第一种方法返回更多的数据。
看到这个错误后:https://github.com/pydata/pandas/issues/5913 我想知道我是否遇到过类似的事情。尽管就我而言,我没有提供预期的行。所以,我还是决定尝试运行 ptrepack,但现在我收到了一个错误:
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/bin/ptrepack --chunkshape=auto --propindexes --keep-source-filters --complib blosc --complevel 7 /Volumes/Untitled/AllData.h5 /Volumes/Untitled/AllData_repack.h5
Problems doing the copy from '/Volumes/Untitled/AllData.h5:/ (RootGroup) ''' to '/Volumes/Untitled/AllData_repack.h5:/ (RootGroup) '''
The error was --> <class 'tables.exceptions.HDF5ExtError'>: HDF5 error back trace
File "H5F.c", line 522, in H5Fcreate
unable to create file
File "H5Fint.c", line 1022, in H5F_open
unable to truncate a file which is already open
End of HDF5 error back trace
Unable to open/create file '/Volumes/Untitled/pytables-powkl3.tmp'
The destination file looks like:
/Volumes/Untitled/AllFlightLogs_repack.h5 (File) ''
Last modif.: 'Wed Aug 19 09:37:28 2015'
Object Tree:
/ (RootGroup) ''
/mydata (Group) ''
/mydata/table (Table(2772065,), shuffle, blosc(7)) ''
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/bin/ptrepack", line 9, in <module>
load_entry_point('tables==3.2.1', 'console_scripts', 'ptrepack')()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tables/scripts/ptrepack.py", line 525, in main
use_hardlinks=True)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tables/scripts/ptrepack.py", line 251, in copy_children
raise RuntimeError("Please check that the node names are not "
RuntimeError: Please check that the node names are not duplicated in destination, and if so, add the --overwrite-nodes flag if desired. In particular, pay attention that root_uep is not fooling you.
所以,hdf5 文件似乎有问题,但我不确定这个错误是什么意思,或者如何纠正它。
回答 cmets 的问题: store.mydata.info() 返回:
Int64Index: 2772065 entries, 0 to 1640
Data columns (total 63 columns):
DateTimeGMT datetime64[ns]
AtvWpt object
Latitude float64
Longitude float64
AltB float64
BaroA float64
AltMSL float64
OAT float64
IAS float64
GndSpd float64
VSpd float64
Pitch float64
Roll float64
LatAc float64
NormAc float64
HDG float64
TRK float64
volt1 float64
volt2 float64
amp1 float64
amp2 float64
FQtyL float64
FQtyR float64
E1_FFlow float64
E1_FPres float64
E1_OilT float64
E1_OilP float64
E1_Torq float64
E1_NP float64
E1_NG float64
E1_ITT float64
E2_Torq float64
E2_NP float64
E2_NG float64
E2_ITT float64
AltGPS float64
TAS float64
HSIS object
CRS float64
NAV1 float64
NAV2 float64
COM1 float64
COM2 float64
HCDI float64
VCDI float64
WndSpd float64
WndDr float64
WptDst float64
WptBrg float64
MagVar float64
AfcsOn float64
RollM object
PitchM object
RollC float64
PichC float64
VSpdG float64
GPSfix object
HAL float64
VAL float64
HPLwas float64
HPLfd float64
VPLwas float64
SrcFile object
dtypes: datetime64[ns](1), float64(56), object(6)
memory usage: 1.3+ GB
我生成的文件是这样的:
for file in files:
print (file + " Num: "+str(file_num)+" of: "+str(len(files)))
file_num=file_num+1
in_pd=read_file(file)
head, tail = path.split(file)
in_pd["SrcFile"]=tail
in_pd.to_hdf('AllData.h5','mydata',mode='a',append=True,complib='blosc', complevel=7,data_columns=Search_cols,min_itemsize={'SrcFile':30})
函数 read_file 只是读取一个 csv,进行一些解析并返回一个 DataFrame。
最后,ptdump -av 返回:
/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.1',
TITLE := '',
VERSION := '1.0']
/mydata (Group) ''
/mydata._v_attrs (AttributeSet), 15 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := ['DateTimeGMT', 'Latitude', 'Longitude', 'AltB', 'BaroA', 'AltMSL', 'IAS', 'GndSpd', 'VSpd', 'Pitch', 'Roll', 'LatAc', 'NormAc', 'HDG', 'FQtyL', 'FQtyR', 'E1_FFlow', 'E1_FPres', 'E1_OilT', 'E1_OilP', 'E1_Torq', 'E1_NP', 'E1_NG', 'E1_ITT', 'TAS', 'SrcFile'],
encoding := None,
index_cols := [(0, 'index')],
info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
levels := 1,
metadata := [],
nan_rep := 'nan',
non_index_axes := [(1, ['DateTimeGMT', 'AtvWpt', 'Latitude', 'Longitude', 'AltB', 'BaroA', 'AltMSL', 'OAT', 'IAS', 'GndSpd', 'VSpd', 'Pitch', 'Roll', 'LatAc', 'NormAc', 'HDG', 'TRK', 'volt1', 'volt2', 'amp1', 'amp2', 'FQtyL', 'FQtyR', 'E1_FFlow', 'E1_FPres', 'E1_OilT', 'E1_OilP', 'E1_Torq', 'E1_NP', 'E1_NG', 'E1_ITT', 'E2_Torq', 'E2_NP', 'E2_NG', 'E2_ITT', 'AltGPS', 'TAS', 'HSIS', 'CRS', 'NAV1', 'NAV2', 'COM1', 'COM2', 'HCDI', 'VCDI', 'WndSpd', 'WndDr', 'WptDst', 'WptBrg', 'MagVar', 'AfcsOn', 'RollM', 'PitchM', 'RollC', 'PichC', 'VSpdG', 'GPSfix', 'HAL', 'VAL', 'HPLwas', 'HPLfd', 'VPLwas', 'SrcFile'])],
pandas_type := 'frame_table',
pandas_version := '0.15.2',
table_type := 'appendable_frame',
values_cols := ['values_block_0', 'values_block_1', 'DateTimeGMT', 'Latitude', 'Longitude', 'AltB', 'BaroA', 'AltMSL', 'IAS', 'GndSpd', 'VSpd', 'Pitch', 'Roll', 'LatAc', 'NormAc', 'HDG', 'FQtyL', 'FQtyR', 'E1_FFlow', 'E1_FPres', 'E1_OilT', 'E1_OilP', 'E1_Torq', 'E1_NP', 'E1_NG', 'E1_ITT', 'TAS', 'SrcFile']]
/mydata/table (Table(2772065,), shuffle, blosc(7)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(32,), dflt=0.0, pos=1),
"values_block_1": StringCol(itemsize=6, shape=(5,), dflt='', pos=2),
"DateTimeGMT": Int64Col(shape=(), dflt=0, pos=3),
"Latitude": Float64Col(shape=(), dflt=0.0, pos=4),
"Longitude": Float64Col(shape=(), dflt=0.0, pos=5),
"AltB": Float64Col(shape=(), dflt=0.0, pos=6),
"BaroA": Float64Col(shape=(), dflt=0.0, pos=7),
"AltMSL": Float64Col(shape=(), dflt=0.0, pos=8),
"IAS": Float64Col(shape=(), dflt=0.0, pos=9),
"GndSpd": Float64Col(shape=(), dflt=0.0, pos=10),
"VSpd": Float64Col(shape=(), dflt=0.0, pos=11),
"Pitch": Float64Col(shape=(), dflt=0.0, pos=12),
"Roll": Float64Col(shape=(), dflt=0.0, pos=13),
"LatAc": Float64Col(shape=(), dflt=0.0, pos=14),
"NormAc": Float64Col(shape=(), dflt=0.0, pos=15),
"HDG": Float64Col(shape=(), dflt=0.0, pos=16),
"FQtyL": Float64Col(shape=(), dflt=0.0, pos=17),
"FQtyR": Float64Col(shape=(), dflt=0.0, pos=18),
"E1_FFlow": Float64Col(shape=(), dflt=0.0, pos=19),
"E1_FPres": Float64Col(shape=(), dflt=0.0, pos=20),
"E1_OilT": Float64Col(shape=(), dflt=0.0, pos=21),
"E1_OilP": Float64Col(shape=(), dflt=0.0, pos=22),
"E1_Torq": Float64Col(shape=(), dflt=0.0, pos=23),
"E1_NP": Float64Col(shape=(), dflt=0.0, pos=24),
"E1_NG": Float64Col(shape=(), dflt=0.0, pos=25),
"E1_ITT": Float64Col(shape=(), dflt=0.0, pos=26),
"TAS": Float64Col(shape=(), dflt=0.0, pos=27),
"SrcFile": StringCol(itemsize=30, shape=(), dflt='', pos=28)}
byteorder := 'little'
chunkshape := (125,)
autoindex := True
colindexes := {
"E1_OilT": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"E1_NG": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"NormAc": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"Pitch": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"E1_Torq": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"E1_FPres": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"HDG": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"Longitude": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"Latitude": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"Roll": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"IAS": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"E1_FFlow": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"E1_ITT": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"FQtyL": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"LatAc": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"TAS": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"AltMSL": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"BaroA": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"SrcFile": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"AltB": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"DateTimeGMT": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"E1_OilP": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"E1_NP": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"VSpd": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"FQtyR": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"GndSpd": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/flights/table._v_attrs (AttributeSet), 147 attributes:
[AltB_dtype := 'float64',
AltB_kind := ['AltB'],
AltB_meta := None,
AltMSL_dtype := 'float64',
AltMSL_kind := ['AltMSL'],
AltMSL_meta := None,
BaroA_dtype := 'float64',
BaroA_kind := ['BaroA'],
BaroA_meta := None,
CLASS := 'TABLE',
DateTimeGMT_dtype := 'datetime64',
DateTimeGMT_kind := ['DateTimeGMT'],
DateTimeGMT_meta := None,
E1_FFlow_dtype := 'float64',
E1_FFlow_kind := ['E1_FFlow'],
E1_FFlow_meta := None,
E1_FPres_dtype := 'float64',
E1_FPres_kind := ['E1_FPres'],
E1_FPres_meta := None,
E1_ITT_dtype := 'float64',
E1_ITT_kind := ['E1_ITT'],
E1_ITT_meta := None,
E1_NG_dtype := 'float64',
E1_NG_kind := ['E1_NG'],
E1_NG_meta := None,
E1_NP_dtype := 'float64',
E1_NP_kind := ['E1_NP'],
E1_NP_meta := None,
E1_OilP_dtype := 'float64',
E1_OilP_kind := ['E1_OilP'],
E1_OilP_meta := None,
E1_OilT_dtype := 'float64',
E1_OilT_kind := ['E1_OilT'],
E1_OilT_meta := None,
E1_Torq_dtype := 'float64',
E1_Torq_kind := ['E1_Torq'],
E1_Torq_meta := None,
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_10_FILL := 0.0,
FIELD_10_NAME := 'GndSpd',
FIELD_11_FILL := 0.0,
FIELD_11_NAME := 'VSpd',
FIELD_12_FILL := 0.0,
FIELD_12_NAME := 'Pitch',
FIELD_13_FILL := 0.0,
FIELD_13_NAME := 'Roll',
FIELD_14_FILL := 0.0,
FIELD_14_NAME := 'LatAc',
FIELD_15_FILL := 0.0,
FIELD_15_NAME := 'NormAc',
FIELD_16_FILL := 0.0,
FIELD_16_NAME := 'HDG',
FIELD_17_FILL := 0.0,
FIELD_17_NAME := 'FQtyL',
FIELD_18_FILL := 0.0,
FIELD_18_NAME := 'FQtyR',
FIELD_19_FILL := 0.0,
FIELD_19_NAME := 'E1_FFlow',
FIELD_1_FILL := 0.0,
FIELD_1_NAME := 'values_block_0',
FIELD_20_FILL := 0.0,
FIELD_20_NAME := 'E1_FPres',
FIELD_21_FILL := 0.0,
FIELD_21_NAME := 'E1_OilT',
FIELD_22_FILL := 0.0,
FIELD_22_NAME := 'E1_OilP',
FIELD_23_FILL := 0.0,
FIELD_23_NAME := 'E1_Torq',
FIELD_24_FILL := 0.0,
FIELD_24_NAME := 'E1_NP',
FIELD_25_FILL := 0.0,
FIELD_25_NAME := 'E1_NG',
FIELD_26_FILL := 0.0,
FIELD_26_NAME := 'E1_ITT',
FIELD_27_FILL := 0.0,
FIELD_27_NAME := 'TAS',
FIELD_28_FILL := '',
FIELD_28_NAME := 'SrcFile',
FIELD_2_FILL := '',
FIELD_2_NAME := 'values_block_1',
FIELD_3_FILL := 0,
FIELD_3_NAME := 'DateTimeGMT',
FIELD_4_FILL := 0.0,
FIELD_4_NAME := 'Latitude',
FIELD_5_FILL := 0.0,
FIELD_5_NAME := 'Longitude',
FIELD_6_FILL := 0.0,
FIELD_6_NAME := 'AltB',
FIELD_7_FILL := 0.0,
FIELD_7_NAME := 'BaroA',
FIELD_8_FILL := 0.0,
FIELD_8_NAME := 'AltMSL',
FIELD_9_FILL := 0.0,
FIELD_9_NAME := 'IAS',
FQtyL_dtype := 'float64',
FQtyL_kind := ['FQtyL'],
FQtyL_meta := None,
FQtyR_dtype := 'float64',
FQtyR_kind := ['FQtyR'],
FQtyR_meta := None,
GndSpd_dtype := 'float64',
GndSpd_kind := ['GndSpd'],
GndSpd_meta := None,
HDG_dtype := 'float64',
HDG_kind := ['HDG'],
HDG_meta := None,
IAS_dtype := 'float64',
IAS_kind := ['IAS'],
IAS_meta := None,
LatAc_dtype := 'float64',
LatAc_kind := ['LatAc'],
LatAc_meta := None,
Latitude_dtype := 'float64',
Latitude_kind := ['Latitude'],
Latitude_meta := None,
Longitude_dtype := 'float64',
Longitude_kind := ['Longitude'],
Longitude_meta := None,
NROWS := 2772065,
NormAc_dtype := 'float64',
NormAc_kind := ['NormAc'],
NormAc_meta := None,
Pitch_dtype := 'float64',
Pitch_kind := ['Pitch'],
Pitch_meta := None,
Roll_dtype := 'float64',
Roll_kind := ['Roll'],
Roll_meta := None,
SrcFile_dtype := 'string240',
SrcFile_kind := ['SrcFile'],
SrcFile_meta := None,
TAS_dtype := 'float64',
TAS_kind := ['TAS'],
TAS_meta := None,
TITLE := '',
VERSION := '2.7',
VSpd_dtype := 'float64',
VSpd_kind := ['VSpd'],
VSpd_meta := None,
index_kind := 'integer',
values_block_0_dtype := 'float64',
values_block_0_kind := ['AfcsOn', 'AltGPS', 'COM1', 'COM2', 'CRS', 'E2_ITT', 'E2_NG', 'E2_NP', 'E2_Torq', 'HAL', 'HCDI', 'HPLfd', 'HPLwas', 'MagVar', 'NAV1', 'NAV2', 'OAT', 'PichC', 'RollC', 'TRK', 'VAL', 'VCDI', 'VPLwas', 'VSpdG', 'WndDr', 'WndSpd', 'WptBrg', 'WptDst', 'amp1', 'amp2', 'volt1', 'volt2'],
values_block_0_meta := None,
values_block_1_dtype := 'string48',
values_block_1_kind := ['AtvWpt', 'GPSfix', 'HSIS', 'PitchM', 'RollM'],
values_block_1_meta := None]
【问题讨论】:
-
你可以尝试使用 PyTables 3.2.1;虽然最后一个错误看起来像是一个损坏的文件。
-
我使用的是 3.2.1(既可以从 CSV 生成 hdf5 文件,也可以像上面那样读取)
-
显示足够的数据样本(例如 df.info())。以及您是如何创建表格的。