导入大型 CSV 文件时出现 Neo4j/Py2Neo 超时问题答案

【问题标题】：Neo4j/Py2Neo timeout issue when importing large CSV files导入大型 CSV 文件时出现 Neo4j/Py2Neo 超时问题
【发布时间】：2019-05-31 10:18:04
【问题描述】：

将大型 CSV 文件 (>200MB) 中的数据导入 Neo4j 时，响应会挂起。 查询完成，并且所有记录都已导入，但似乎存在某种响应超时，导致没有指示导入查询已完成。这是一个问题，因为我们无法自动将多个文件导入 Neo4j，因为脚本继续等待查询完成，即使它已经完成。

导入 1 个文件大约需要 10-15 分钟。

管道中的任何地方都不会抛出任何错误，一切都只是挂起。当虚拟机 CPU 活动停止时，我只能知道进程何时完成。

此过程适用于较小的文件，并且会在前一个文件完成导入时发回确认，然后移至下一个。

我尝试过直接在控制台上运行 Jupyter notebook 和 python 脚本中的脚本。我什至还尝试通过浏览器控制台直接在 Neo4j 上运行查询。每种方式都会导致查询挂起，因此我不确定问题是来自 Neo4j 还是 Py2Neo。

查询示例：

USING PERIODIC COMMIT 1000
LOAD CSV FROM {csvfile}  AS line
MERGE (:Author { authorid: line[0], name: line[1] } )

使用 Py2Neo 修改的 Python 脚本：

from azure.storage.blob import BlockBlobService
blob_service = BlockBlobService(account_name="<name>",account_key="<key>")
generator = blob_service.list_blobs("parsed-csv-files")

for blob in generator:
    print(blob.name)
    csv_file_base = "http://<base_uri>/parsed-csv-files/"
    csvfile = csv_file_base + blob.name
    params = { "csvfile":csvfile }
    mygraph.run(query, parameters=params )

Neo4j debug.log 似乎没有记录任何错误。

debug.log 示例：

2019-05-30 05:44:32.022+0000 INFO [o.n.k.i.i.s.GenericNativeIndexProvider] Schema index cleanup job finished: descriptor=IndexRule[id=16, descriptor=Index( UNIQUE, :label[5](property[5]) ), provider={key=native-btree, version=1.0}, owner=42], indexFile=/data/databases/graph.db/schema/index/native-btree-1.0/16/index-16 Number of pages visited: 598507, Number of cleaned crashed pointers: 0, Time spent: 2m 25s 235ms
2019-05-30 05:44:32.071+0000 INFO [o.n.k.i.i.s.GenericNativeIndexProvider] Schema index cleanup job closed: descriptor=IndexRule[id=16, descriptor=Index( UNIQUE, :label[5](property[5]) ), provider={key=native-btree, version=1.0}, owner=42], indexFile=/data/databases/graph.db/schema/index/native-btree-1.0/16/index-16
2019-05-30 05:44:32.071+0000 INFO [o.n.k.i.i.s.GenericNativeIndexProvider] Schema index cleanup job started: descriptor=IndexRule[id=19, descriptor=Index( UNIQUE, :label[6](property[6]) ), provider={key=native-btree, version=1.0}, owner=46], indexFile=/data/databases/graph.db/schema/index/native-btree-1.0/19/index-19
2019-05-30 05:44:57.126+0000 INFO [o.n.k.i.i.s.GenericNativeIndexProvider] Schema index cleanup job finished: descriptor=IndexRule[id=19, descriptor=Index( UNIQUE, :label[6](property[6]) ), provider={key=native-btree, version=1.0}, owner=46], indexFile=/data/databases/graph.db/schema/index/native-btree-1.0/19/index-19 Number of pages visited: 96042, Number of cleaned crashed pointers: 0, Time spent: 25s 55ms
2019-05-30 05:44:57.127+0000 INFO [o.n.k.i.i.s.GenericNativeIndexProvider] Schema index cleanup job closed: descriptor=IndexRule[id=19, descriptor=Index( UNIQUE, :label[6](property[6]) ), provider={key=native-btree, version=1.0}, owner=46], indexFile=/data/databases/graph.db/schema/index/native-btree-1.0/19/index-19

编辑：使用了更简单的查询，但仍然会出现同样的问题

【问题讨论】：

Neo4j 在使用分配给它的全部内存时挂起。 您可以从neo4j.conf 增加最大堆内存并重新启动 Neo4j。
还在:Paper(paperid) 和:Keyword(name) 上创建索引以加快查询速度。 .
不建议像您正在做的那样在一个查询中创建所有节点和关系。您可以将查询拆分为 2 或 3 个查询，分别加载节点和关系
您好 Raj，感谢您的回复。我们已经尝试增加最大堆内存，如果需要会再试一次。但是，查询确实完成了所有记录的导入，只是似乎给出了问题的响应。如果在查询完成后我手动停止 python 脚本并使用下一个文件再次运行，neo4j 将再次开始运行新查询。
@AndrewCachia 你是怎么解决这个问题的？

标签： python neo4j py2neo

【解决方案1】：

由于在 DB 端完成查询需要花费大量时间，因此 py2neo 可能存在等待问题。

定期提交应该没有任何问题。

您是否尝试过 Python neo4j 驱动程序并从 python 读取 csv 并以这种方式执行查询？

这里是带有 neo4j 驱动程序的示例代码。

import pandas as pd
from neo4j import GraphDatabase

driver = GraphDatabase.driver(serveruri, auth=(user,pwd))
with driver.session() as session:
    file = config['spins_file']
    row_chunks = pd.read_csv(file, sep=',', error_bad_lines=False,
                       index_col=False,
                       low_memory=False,
                       chunksize=config['chunk_size'])
    for i, rows in enumerate(row_chunks):
        print("Chunk {}".format(i))
        rows_dict = {'rows': rows.fillna(value="").to_dict('records')}
        session.run(statement="""
                    unwind data.rows as row
                    MERGE (:Author { authorid: line[0], name: line[1] } )
                    """,
                    dict=rows_dict)

【讨论】：