Pandas + SQLAlchemy 默认将所有object(字符串)列保存为Oracle DB中的CLOB,这使得插入非常缓慢。
这里有一些测试:
import pandas as pd
import cx_Oracle
from sqlalchemy import types, create_engine
#######################################################
### DB connection strings config
#######################################################
tns = """
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = my-db-scan)(PORT = 1521))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = my_service_name)
)
)
"""
usr = "test"
pwd = "my_oracle_password"
engine = create_engine('oracle+cx_oracle://%s:%s@%s' % (usr, pwd, tns))
# sample DF [shape: `(2000, 11)`]
# i took your 2 rows DF and replicated it: `df = pd.concat([df]* 10**3, ignore_index=True)`
df = pd.read_csv('/path/to/file.csv')
DF 信息:
In [61]: df.shape
Out[61]: (2000, 11)
In [62]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 11 columns):
id 2000 non-null int64
name 2000 non-null object
premium 2000 non-null float64
created_date 2000 non-null datetime64[ns]
init_p 2000 non-null float64
term_number 2000 non-null int64
uprate 1000 non-null float64
value 2000 non-null int64
score 2000 non-null float64
group 2000 non-null int64
action_reason 2000 non-null object
dtypes: datetime64[ns](1), float64(4), int64(4), object(2)
memory usage: 172.0+ KB
让我们看看将它存储到 Oracle DB 需要多长时间:
In [57]: df.shape
Out[57]: (2000, 11)
In [58]: %timeit -n 1 -r 1 df.to_sql('test_table', engine, index=False, if_exists='replace')
1 loop, best of 1: 16 s per loop
在 Oracle DB 中(注意 CLOB):
AAA> desc test.test_table
Name Null? Type
------------------------------- -------- ------------------
ID NUMBER(19)
NAME CLOB # !!!
PREMIUM FLOAT(126)
CREATED_DATE DATE
INIT_P FLOAT(126)
TERM_NUMBER NUMBER(19)
UPRATE FLOAT(126)
VALUE NUMBER(19)
SCORE FLOAT(126)
group NUMBER(19)
ACTION_REASON CLOB # !!!
现在让我们指示 pandas 将所有 object 列保存为 VARCHAR 数据类型:
In [59]: dtyp = {c:types.VARCHAR(df[c].str.len().max())
...: for c in df.columns[df.dtypes == 'object'].tolist()}
...:
In [60]: %timeit -n 1 -r 1 df.to_sql('test_table', engine, index=False, if_exists='replace', dtype=dtyp)
1 loop, best of 1: 335 ms per loop
这次大约是。快 48 倍
签入 Oracle DB:
AAA> desc test.test_table
Name Null? Type
----------------------------- -------- ---------------------
ID NUMBER(19)
NAME VARCHAR2(13 CHAR) # !!!
PREMIUM FLOAT(126)
CREATED_DATE DATE
INIT_P FLOAT(126)
TERM_NUMBER NUMBER(19)
UPRATE FLOAT(126)
VALUE NUMBER(19)
SCORE FLOAT(126)
group NUMBER(19)
ACTION_REASON VARCHAR2(8 CHAR) # !!!
让我们用 200.000 行 DF 测试它:
In [69]: df.shape
Out[69]: (200000, 11)
In [70]: %timeit -n 1 -r 1 df.to_sql('test_table', engine, index=False, if_exists='replace', dtype=dtyp, chunksize=10**4)
1 loop, best of 1: 4.68 s per loop
在我的测试(不是最快的)环境中,200K 行 DF 需要大约 5 秒。
结论:在将 DataFrames 保存到 Oracle DB 时,使用以下技巧为 object dtype 的所有 DF 列显式指定 dtype。否则会被保存为 CLOB 数据类型,需要特殊处理,速度很慢
dtyp = {c:types.VARCHAR(df[c].str.len().max())
for c in df.columns[df.dtypes == 'object'].tolist()}
df.to_sql(..., dtype=dtyp)