pandas to_sql 截断我的数据答案

【问题标题】：pandas to_sql truncates my datapandas to_sql 截断我的数据
【发布时间】：2017-04-27 12:56:24
【问题描述】：

我使用df.to_sql(con=con_mysql, name='testdata', if_exists='replace', flavor='mysql') 将数据框导出到mysql。但是，我发现具有长字符串内容（如 url）的列被截断为 63 位。导出时，我收到了来自 ipython notebook 的以下警告：

/usr/local/lib/python2.7/site-packages/pandas/io/sql.py:248：警告：第 3 行的“url”列的数据被截断 cur.executemany(insert_query, data)

对于不同的行，还有其他相同样式的警告。

有什么我可以调整以正确导出完整数据的吗？我可以在 mysql 中设置正确的数据模式，然后导出到该模式。但我希望一个调整可以让它直接从 python 工作。

【问题讨论】：

你用的是什么版本的熊猫？
0.12.0，当我有问题时。我刚刚升级到 pip install 提供的最新版本 0.13.1。但从你的回答来看，0.12.0 会有同样的问题。

标签： python mysql sql pandas

【解决方案1】：

如果您使用的是 pandas 0.13.1 或更早版本，这个 63 位数字的限制确实是硬编码的，因为代码中有这行：https://github.com/pydata/pandas/blob/v0.13.1/pandas/io/sql.py#L278

作为一种变通方法，您也许可以对该函数 get_sqltype 进行猴子补丁：

from pandas.io import sql

def get_sqltype(pytype, flavor):
    sqltype = {'mysql': 'VARCHAR (63)',    # <-- change this value to something sufficient higher
               'sqlite': 'TEXT'}

    if issubclass(pytype, np.floating):
        sqltype['mysql'] = 'FLOAT'
        sqltype['sqlite'] = 'REAL'
    if issubclass(pytype, np.integer):
        sqltype['mysql'] = 'BIGINT'
        sqltype['sqlite'] = 'INTEGER'
    if issubclass(pytype, np.datetime64) or pytype is datetime:
        sqltype['mysql'] = 'DATETIME'
        sqltype['sqlite'] = 'TIMESTAMP'
    if pytype is datetime.date:
        sqltype['mysql'] = 'DATE'
        sqltype['sqlite'] = 'TIMESTAMP'
    if issubclass(pytype, np.bool_):
        sqltype['sqlite'] = 'INTEGER'

    return sqltype[flavor]

sql.get_sqltype = get_sqltype

然后只需使用您的代码即可：

df.to_sql(con=con_mysql, name='testdata', if_exists='replace', flavor='mysql')

从 pandas 0.14 开始，sql 模块在底层使用 sqlalchemy，字符串转换为 sqlalchemy TEXT 类型，然后转换为 mysql TEXT 类型（而不是VARCHAR)，这也将允许您存储大于 63 位的字符串：

engine = sqlalchemy.create_engine('mysql://scott:tiger@localhost/foo')
df.to_sql('testdata', engine, if_exists='replace')

仅当您仍然使用 DBAPI 连接而不是 sqlalchemy 引擎时，问题仍然存在，但此选项已弃用，建议向 to_sql 提供 sqlalchemy 引擎。

【讨论】：

我在 pandas 0.14 中遇到了同样的问题。如果我正确理解了代码 (github.com/pydata/pandas/blob/v0.14.0/pandas/io/sql.py#L847)，它仍然被硬编码为 varchar(63)。我必须改变它才能让它工作。
应该在github上报告这个问题吗？是吗？
这只是使用 DBAPI 连接的已弃用 mysql 风格的问题，使用 sqlalchemy 引擎的新实现没有问题，所以我认为不值得更改。跨度>

【解决方案2】：

受@joris 回答的启发，我决定将更改硬编码到 panda 的源代码中并重新编译。

cd /usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/io
sudo pico sql.py

换行871

'mysql': 'VARCHAR (63)',

到

'mysql': 'VARCHAR (255)',

然后重新编译那个文件

sudo python -m py_compile sql.py

重新启动我的脚本，_to_sql() 函数写了一个表格。（我预计重新编译会破坏 pandas，但似乎没有。）

这是我将数据帧写入mysql的脚本，供参考。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sqlalchemy 
from sqlalchemy import create_engine
df = pd.read_csv('10k.csv')
## ... dataframe munging
df = df.where(pd.notnull(df), None) # workaround for NaN bug
engine = create_engine('mysql://user:password@localhost:3306/dbname')
con = engine.connect().connection
df.to_sql("issues", con, 'mysql', if_exists='replace', index=True, index_label=None)

【讨论】：