Pandas to_sql 使索引唯一答案

【问题标题】：Pandas to_sql make index uniquePandas to_sql 使索引唯一
【发布时间】：2017-09-02 18:11:20
【问题描述】：

我一直在阅读关于不向数据库添加重复记录的 pandas to_sql 解决方案。我正在处理日志的 csv 文件，每次我上传一个新的日志文件时，我都会读取数据并使用 pandas 创建一个新的数据框进行一些更改。然后我执行to_sql('Logs',con = db.engine, if_exists = 'append', index=True)。使用if_existsarg i 确保每次从新文件中创建的新数据框都附加到现有数据库中。问题是它不断添加重复值。我想确保如果一个已经上传的文件被错误地再次上传，它不会被附加到数据库中。我想在创建数据库时直接尝试这样做，而没有找到解决方法，例如检查文件名是否以前被使用过。

我正在使用 flask-sqlalchemy。

谢谢。

【问题讨论】：

最近有一个关于将 upserts 添加到 Pandas here 的讨论。 TL;DR - 目前被认为超出了 Pandas 的范围，因为它很难保持与数据库无关。（用重复项替换条目是一种 upsert。）
有没有办法不替换条目而在重复时忽略数据框？日志文件每月生成。真的，我只关心不要重新附加已经添加到数据库中的数据框，以防有人错误地两次上传同一个文件。我在另一篇文章中看到一个可能的解决方案是使用 sqlite3.IntegrityError 但这对我不起作用。
对于未来的读者：我已经使用了几年的 解决方案，虽然速度很慢，但效果很好 - 是迭代 DataFrame（是的，我知道 . ..) 和try 使用to_sql 插入每一行。在except 块中，测试'1062' 是否出现在错误输出中，因为这表示重复。
您还可以让数据库引擎完成其检查唯一性的工作，方法是在使用 pandas to_sql 之前创建表结构并在数据库结构中指定唯一条件。

标签： pandas unique primary-key flask-sqlalchemy pandas-to-sql

【解决方案1】：

最好的办法是通过将索引设置为主键来捕获重复项，然后使用try/except 来捕获唯一性违规。您提到了另一篇建议注意IntegrityError 异常的帖子，我同意这是最好的方法。您可以将其与重复数据删除功能结合使用，以确保您的表更新运行顺利。

演示问题

这是一个玩具示例：

from sqlalchemy import *
import sqlite3

# make a database, 'test', and a table, 'foo'.
conn = sqlite3.connect("test.db")
c = conn.cursor()
# id is a primary key.  this will be the index column imported from to_sql().
c.execute('CREATE TABLE foo (id integer PRIMARY KEY, foo integer NOT NULL);')
# use the sqlalchemy engine.
engine = create_engine('sqlite:///test.db')

pd.read_sql("pragma table_info(foo)", con=engine)

   cid name     type  notnull dflt_value  pk
0    0   id  integer        0       None   1
1    1  foo  integer        1       None   0

现在，两个示例数据帧，df 和 df2：

data = {'foo':[1,2,3]}
df = pd.DataFrame(data)
df
   foo
0    1
1    2
2    3

data2 = {'foo':[3,4,5]}
df2 = pd.DataFrame(data2, index=[2,3,4])
df2
   foo
2    3       # this row is a duplicate of df.iloc[2,:]
3    4
4    5

将df 移动到表foo：

df.to_sql('foo', con=engine, index=True, index_label='id', if_exists='append')

pd.read_sql('foo', con=engine)
   id  foo
0   0    1
1   1    2
2   2    3

现在，当我们尝试追加 df2 时，我们捕获了 IntegrityError：

try:
    df2.to_sql('foo', con=engine, index=True, index_label='id', if_exists='append')
# use the generic Exception, both IntegrityError and sqlite3.IntegrityError caused trouble.
except Exception as e: 
    print("FAILURE TO APPEND: {}".format(e))

输出：

FAILURE TO APPEND: (sqlite3.IntegrityError) UNIQUE constraint failed: foo.id [SQL: 'INSERT INTO foo (id, foo) VALUES (?, ?)'] [parameters: ((2, 3), (3, 4), (4, 5))]

建议的解决方案

在IntegrityError，您可以拉取现有表数据，删除新数据的重复条目，然后重试追加语句。为此使用apply()：

def append_db(data):
    try:
        data.to_sql('foo', con=engine, index=True, index_label='id', if_exists='append')
        return 'Success'
    except Exception as e:
        print("Initial failure to append: {}\n".format(e))
        print("Attempting to rectify...")
        existing = pd.read_sql('foo', con=engine)
        to_insert = data.reset_index().rename(columns={'index':'id'})
        mask = ~to_insert.id.isin(existing.id)
        try:
            to_insert.loc[mask].to_sql('foo', con=engine, index=False, if_exists='append')
            print("Successful deduplication.")
        except Exception as e2:
            "Could not rectify duplicate entries. \n{}".format(e2)
        return 'Success after dedupe'

df2.apply(append_db)

输出：

Initial failure to append: (sqlite3.IntegrityError) UNIQUE constraint failed: foo.id [SQL: 'INSERT INTO foo (id, foo) VALUES (?, ?)'] [parameters: ((2, 3), (3, 4), (4, 5))]

Attempting to rectify...
Successful deduplication.

foo    Success after dedupe
dtype: object

【讨论】：

感谢您的回复，但是在提到 IntergrityError 作为解决方案的帖子中，它不需要任何额外的步骤。毕竟我真的很想避免创建临时数据库。起初我正在使用 Flask-SQL Alchemy，我认为通过定义模型并将我的索引设置为主键它会起作用但没有（我想毕竟模型中的所有表和由熊猫装箱的表是不同的） .有没有办法直接用 pandas 设置我的主键，或者用 SQLAlchemy 的解决方案？
您无法使用 Pandas 设置架构详细信息，尽管您可以使用 schema 参数指定现有架构。您没有说明为什么捕获 IntegrityError 不起作用，这就是为什么我演示了一个可行的解决方案。该解决方案确实使用了 SQLAlchemy...恐怕我不太清楚您的问题到底是什么。请考虑使用适当的MCVE 更新您的原始帖子。
请查看MCVE 指南 - 您发布的示例代码不是最小的、完整的或可验证的。如果您在表中正确指定了主键，则不会出现重复项，但在使用 to_sql 时会出现错误。我的解决方案的形式不会创建临时数据库，但它会检查现有条目以查找重复项。我不确定你能否绕过这一步。