从Access数据库中读取大量数据答案

【问题标题】：Reading large amount of data from Access database从Access数据库中读取大量数据
【发布时间】：2016-02-11 15:11:37
【问题描述】：

寻求有关如何解决我的特定问题的建议（MemoryError，因为在一个变量中存储了太多信息），以及有关我可以解决问题的不同方法的一般建议。

我有一个 Access 1997 数据库，我正在尝试从中提取数据。由于我安装了 Access 2013，所以如果不下载 Access 2003，我将无法打开数据库。没问题 -- 我可以使用 pyodbc 和 Jet 使用 python 进行提取。

我与数据库建立了pyodbc 游标连接，并编写了这个函数来首先查询所有表名，然后是与这些表关联的所有列：

def get_schema(cursor):
    """
    :param cursor: Cursor object to database
    :return: Dictionary with table name as key and list of columns as value
    """
    db_schema = dict()
    tbls = cursor.tables().fetchall()

    for tbl in tbls:
        if tbl not in db_schema:
            db_schema[tbl] = list()
        column_names = list()
        for col in cursor.columns(table=tbl):
            column_names.append(col[3])
        db_schema[tbl].append(tuple(column_names))

    return db_schema

我得到的变量看起来像这样：

{'Table 1': [('Column 1-1', 'Column 1-2', 'Column 1-3')],
 'Table 2': [('Column 2-1', 'Column 2-2')]}

然后我将该模式变量传递给另一个函数，以将每个表中的数据转储到元组列表中：

def get_table_data(cursor, schema):

    for tbl, cols in schema.items():

        sql = "SELECT * from %s" % tbl  # Dump data
        cursor.execute(sql)  
        col_data = cursor.fetchall()

        for row in col_data:
            cols.append(row)

    return schema

但是，当我尝试读取返回的变量时，我得到以下信息：

>>> schema2 = get_table_data(cursor, schema)
>>> schema2
Traceback (most recent call last):
  File "<input>", line 1, in <module>
MemoryError

TL;DR：当数据变得太大而无法读取时，有没有办法开始将数据存储在另一个变量中？或者增加内存分配的方法？最终，我想将其转储到 csv 文件或类似文件中 - 有没有更直接的方法来解决这个问题？

【问题讨论】：

这不是“单变量内存限制”。我猜你的桌子很大。目前尽量不要读取全部数据。将get_table_data重写为生成器，逐行读取数据。
如果您可以成功地将 Access 97 数据导出为 CSV 文件，您想如何处理这些 CSV 文件？如果您打算将它们导入 Access 2013，任务可能会简单得多。
@AlexBelyaev：作为生成器意味着一次传递一个表名？
@HansUp：我的意图是使用供应商提供的迁移工具将它们导入 Oracle 11g，该工具采用 csv 文件......但是，如果有一种简单的方法可以将它们导入 Access 2013，我可以从那里开始使用它

标签： python ms-access pyodbc data-extraction

【解决方案1】：

您可能希望能够将数据从数据库中流出，而不是一次全部加载。这样，您可以直接将数据写回，而无需一次将太多数据加载到内存中。

最好的方法是使用generators。

因此，与其像现在这样修改架构变量，不如从数据库表中读取各种行：

def get_single_table_data(cursor, tbl):
    '''
    Generator to get all data from one table.
    Does this one row at a time, so we don't load
    too much data in at once
    '''
    sql = "SELECT * from %s" % tbl
    cursor.execute(sql)
    while True:
        row = cursor.fetchone()
        if row is None:
            break
        yield row

def print_all_table_data(cursor, schema):
    for tbl, cols in schema.items():
        print(cols)
        rows = get_single_table_data(cursor, tbl)
        for row in rows:
            print(row)

这显然只是一个示例，但它（理论上）会打印出所有表中的每一行 - 内存中一次不会有超过一行数据。

【讨论】：

太棒了，这更优雅。我将用handle.write 替换最后一行并转储到 csv 或 txt 中。谢谢！