【发布时间】:2021-10-11 06:09:06
【问题描述】:
我需要的是尽快获得unique date 的值。
我使用代码df = store.df.date.drop_duplicates() 进行检索。这行代码采用6 seconds。但是,如果我使用mysql并将相同的数据保存到mysql,我使用mysql作为日期列indexing之后,使用sql:select distinct date from table,只需要80ms就可以检索到唯一的date值,即60 times 比 HDF5 快。
有什么方法可以让函数read_unique_date 读取HDF5 比MySQL uses indexes 更快?
我的代码如下:
import pandas as pd
import numpy as np
from itertools import product
from time import time
def generate_data():
np.random.seed(202108)
# date = pd.date_range(start="19900101", end="20210723", freq="D")
#The above is my original code, you can use the following code to speed up the operation.
date = pd.date_range(start="20210101", end="20210723", freq="D")
date = pd.DataFrame(date, columns=["date"])
# code = pd.DataFrame(range(5000), columns=["code"])
#The above is my original code, you can use the following code to speed up the operation.
code = pd.DataFrame(range(50), columns=["code"])
# generate product of the two columns:
df = pd.DataFrame(product(date["date"], code["code"]), columns=["date", "code"])
df['data'] = np.random.random(len(df))
return df
def save_data(filename, df):
store = pd.HDFStore(filename)
store['df'] = df
store.close()
return
def read_unique_date(file_name):
store = pd.HDFStore(file_name)
start = time()
df = store.df.date.drop_duplicates()
store.close()
stop = time()
print(stop - start)
return df
def main():
path = 'd:\\'
file = 'large data.h5'
file_name = path + file
df = generate_data()
save_data(file_name, df)
df1 = read_unique_date(file_name)
print(df1)
return df1
if __name__ == '__main__':
main()
结果是:
0.015624761581420898
0 2021-01-01
50 2021-01-02
100 2021-01-03
150 2021-01-04
200 2021-01-05
...
9950 2021-07-19
10000 2021-07-20
10050 2021-07-21
10100 2021-07-22
10150 2021-07-23
Name: date, Length: 204, dtype: datetime64[ns]
%timeit df1 = read_unique_date(file_name)
16.9 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
我的原始代码的结果:
%timeit df1 = read_unique_date(file_name)
4.89 s ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
【问题讨论】:
标签: python pandas dataframe hdf5