在跳过重复项时将 CSV 导入到 postgreSQL答案

【问题标题】：Importing CSV to postgreSQL while skipping duplicates在跳过重复项时将 CSV 导入到 postgreSQL
【发布时间】：2020-04-27 16:10:25
【问题描述】：

我是一个完全的 Python（和编码）初学者，所以这可能很难看。

我有一个 CSV 文件，我想将它导入我的 postgresSQL 数据库。 CSV 有大量我不想要的重复项。我相信我可以很好地阅读 CSV，并可以很好地添加到数据库中，但是我在跳过重复项时遇到了麻烦。每次运行下面的代码，我都会插入一行，然后就失败了。

我现在只看关键，但是一旦它起作用了，还有一大堆其他列，添加 [...] 可能不是问题

# Setup

import csv
import psycopg2


# Read a value from the CSV to see if it's in dbItems

with open('meh_0.csv', 'r') as f:
    reader = csv.reader(f)
    next(reader)

    connection = psycopg2.connect("host=localhost dbname=postgres user=postgres port=5433 password=removed")
    cursor = connection.cursor()
    cursor.execute('SELECT handid FROM handlist')
    dbItems = cursor.fetchall()

    print(dbItems)

    for i in range(0, 200):
        rowKey = next(reader)
        print('rowKey[0] is: ' + rowKey[0])

        found = False
        for row in dbItems:
            for element in row:
                if element == int(rowKey[0]):
                    found = True
                    break

            if found:
                break


# Then either add to the DB or skip

        if not found:
            print(rowKey[0] + ' NOT found in dbItems\n')
            sqlCommand = 'INSERT INTO handlist VALUES (' + rowKey[0] + ')'
            cursor.execute(sqlCommand)
            connection.commit()

        else:
            print(rowKey[0] + ' is found in dbItems\n')

我可能已经将一些不需要的东西移到了我的“while”循环中，我试图看看发生了什么变化。哦，最大 200 的范围是任意的，CSV 文件很大。

错误：

rowKey[0] is: 34756717
34756717 is found in dbItems

rowKey[0] is: 34756717
34756717 is found in dbItems

rowKey[0] is: 34756717
34756717 is found in dbItems

rowKey[0] is: 34756718
34756718 NOT found in dbItems

rowKey[0] is: 34756718
34756718 NOT found in dbItems

Traceback (most recent call last):
  File "C:/Python/MyPythonScripts/RIO r5.py", line 40, in <module>
    cursor.execute(sqlCommand)
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "handlist_pkey"
DETAIL:  Key (handid)=(34756718) already exists.

>>>

所以它跳过了我之前运行它时添加的所有键，添加了新的，但是当它迭代循环时不会跳过新的。

主要是，我想知道它为什么不起作用。但我想还有很多更简单的方法可以做到这一点，如果需要，我很乐意复制这些方法。

【问题讨论】：

您在数据库中有预先存在的行，并且您的 csv 包含每个 csv 行的键，并且 csv 中的某些行是预先存在的数据的副本并且具有相同的键，是对吗？
没错。也许这有帮助，它现在是数据库中的内容，至少就代码所知：>>> dbItems [(123,), (234,), (34756712,), (34756713,), (34756714,), (34756715,), (34756716,), (34756717,), (34756718,), (34756719,), (34756720,), (34756721,), (34756722,), (34756723,)] 而 rowKey 是：>>> rowKey ['34756724', '83', '63', '32801031', '3', '6', '1', '\\N', '34620923', '29/05/2019 12:08', '0', '30545092', '29/05/2019 12:08', '\\N', '0', '34756708', '75', '10/09/2019 14:47', '\\N', 'O', '50', '25', '1', '50', 'pot']

标签： python python-3.x csv

【解决方案1】：

您可以将现有密钥添加到 set，然后检查 csv 行中的密钥是否是集合的成员。检查集合成员的成本与集合的大小无关，因此在这里使用集合是一个很好的数据结构。如果 csv 键在集合中，我们将移动到下一行，否则我们将行添加到数据库并将键添加到集合中。

# Setup

import csv
import psycopg2


# Read a value from the CSV to see if it's in dbItems

with open('meh_0.csv', 'r') as f:
    reader = csv.reader(f)
    next(reader)

    connection = psycopg2.connect("host=localhost dbname=postgres user=postgres port=5433 password=removed")
    cursor = connection.cursor()
    cursor.execute('SELECT handid FROM handlist')
    dbItems = cursor.fetchall()

    # Build a set of existing keys
    existing = {k for k, in dbItems}


    for i in range(0, 200):
        rowKey = next(reader)
        print('rowKey[0] is: ' + rowKey[0])

        # Database keys are ints, csv values are strings...
        candidate = int(rowKey[0])

        if candidate in existing:
            # back to the top of the for loop
            continue

        # Add to the DB

        # Use the recommended way of building queries
        # https://www.psycopg.org/docs/usage.html#passing-parameters-to-sql-queries
        sqlCommand = 'INSERT INTO handlist VALUES (%s)'
        cursor.execute(sqlCommand, (candidate,))
        connection.commit()

        # Add our key to the set
        existing.add(candidate)

【讨论】：

这行得通 - 我的帐户太新，无法显示赞成票，但谢谢。不同的数据类型肯定让我感到困惑，我不能说我理解这个集合，但这是一个新的东西。创建 SQL 命令时 %s 做了什么？我不认为那部分对我有用。我将其更改为包含 rowKey[0]，dbItems 现在包含所有键，但也许我不需要？
%s 和cursor.execute 对应的第二个参数用于确保根据数据类型正确引用 SQL 语句中的值。这确保了例如 2020-04-27 被解释为日期而不是算术表达式，更重要的是 ;DELETE FROM mytable; 被解释为字符串而不是 SQL 表达式（又名“SQL 注入”）。 sqlCommand = 行上方评论中的链接解释了如何在 psycopg2 中使用查询参数化以及 SQL 注入的危险。
您可能有理由认为只有您会使用您的脚本，因此 SQL 注入没有风险。不过，最好从一开始就使用查询参数化，这样如果您正在开发面向用户的应用程序，您就可以不假思索地做正确的事情，并且可以在其他人的代码中发现问题。 Obligatory xkcd reference。来自twitter today 的示例。