【问题标题】：Accessing large datasets with Python 3.6, psycopg2 and pandas使用 Python 3.6、psycopg2 和 pandas 访问大型数据集
【发布时间】：2023-03-04 10:23:01
【问题描述】：

我正在尝试将 1.7G 文件从 Greenplum postgres 数据源中提取到 pandas 数据框中。 psycopg2 驱动程序需要 8 分钟左右才能加载。使用 pandas 的“chunksize”参数并没有帮助，因为 psycopg2 驱动程序将所有数据选择到内存中，然后将其交给 pandas，使用的 RAM 远远超过 2G。

为了解决这个问题，我尝试使用命名游标，但我找到的所有示例都会逐行循环。这似乎很慢。 但主要问题似乎是我的 SQL 出于某种未知原因在命名查询中停止工作。

目标

尽可能快地加载数据，而不做任何“不自然行为”
尽可能使用 SQLAlchemy - 用于一致性
将结果保存在 pandas 数据帧中，以便在内存中进行快速处理（替代方案？）

有一个“pythonic”（优雅）的解决方案。我很想用上下文管理器来做这件事，但还没有走那么远。

/// Named Cursor Chunky Access Test
import pandas as pd
import psycopg2
import psycopg2.extras

/// Connect to database - works
conn_chunky = psycopg2.connect(
    database=database, user=username, password=password, host=hostname)
/// Open named cursor - appears to work
cursor_chunky = conn_chunky.cursor(
    'buffered_fetch', cursor_factory=psycopg2.extras.DictCursor)
cursor_chunky.itersize = 100000

/// This is where the problem occurs - the SQL works just fine in all other tests, returns 3.5M records
result = cursor_chunky.execute(sql_query) 
/// result returns None (normal behavior) but result is not iterable

df = pd.DataFrame(result.fetchall())

pandas 调用返回 AttributeError: 'NoneType' object has no attribute 'fetchall' 失败似乎是由于使用了命名游标。已经尝试过 fetchone、fetchmany 等。注意这里的目标是让服务器分块并以大块的形式提供数据，以便在带宽和 CPU 使用率之间取得平衡。遍历 df = df.append(row) 实在是太丑了。

查看相关问题（不是同一个问题）：

为每个请求添加标准客户端分块代码

nrows = 3652504
size = nrows / 1000
idx = 0
first_loop = True
for dfx in pd.read_sql(iso_cmdb_base, engine, coerce_float=False, chunksize=size):
    if first_loop:
        df = dfx
        first_loop = False
    else:
        df = df.append(dfx,ignore_index=True)

【问题讨论】：

标签： python postgresql pandas psycopg2

【解决方案1】：

更新：

#Chunked access
start = time.time()
engine = create_engine(conn_str)
size = 10**4
df = pd.concat((x for x in pd.read_sql(iso_cmdb_base, engine, coerce_float=False, chunksize=size)),
               ignore_index=True)
print('time:', (time.time() - start)/60, 'minutes or ', time.time() - start, 'seconds')

旧答案：

我会尝试使用内部 Pandas 方法从 PostgreSQL 读取数据：read_sql():

from sqlalchemy import create_engine
engine = create_engine('postgresql://user@localhost:5432/dbname')

df = pd.read_sql(sql_query, engine)

【讨论】：

这正是我在其他查询中所做的，但如果您注意到，我创建了一个命名游标，它既不是连接也不是查询。我还没有尝试为数据框提供命名光标，也没有找到一个很好的例子。
@Harvey，我不太明白你为什么需要命名光标——它有什么帮助？
是的，这令人毛骨悚然：DatabaseError: Execution failed on sql ''：参数 1 必须是字符串或 unicode 对象到处都是位。那太差了。 :)
命名游标允许服务器端的数据分块。出于某种让我无法理解的原因，1.7G 的数据库（曾经在 pandas 中）使用了 ~16GB 的 RAM。其他驱动程序（例如 pg8000）使用 13GB RAM，大约需要 20 分钟。想法是让服务器分块和 pandas 处理时间交错，以便它们同时忙碌。但首先我必须让它发挥作用。
@Harvey，我已经更新了我的答案 - 你能检查一下它是否有帮助吗？