首先,1GB 并不是巨大 - 几乎任何现代设备都可以将其保存在工作内存中。其次,pandas 不允许您浏览 CSV 文件,您只能告诉它要“加载”多少数据 - 如果您想做更高级的 CSV 处理,我建议使用内置的 csv 模块。
不幸的是,csv 模块的reader() 将为您的文件生成一个可耗尽的迭代器,因此您不能将其构建为一个简单的循环并等待下一行可用 - 您必须收集新行手动然后将它们喂给它以达到您想要的效果,例如:
import csv
import time
filename = "path/to/your/file.csv"
with open(filename, "rb") as f: # on Python 3.x use: open(filename, "r", newline="")
reader = csv.reader(f) # create a CSV reader
header = next(reader) # grab the first line and keep it as a header reference
print("CSV header: {}".format(header))
for row in reader: # iterate over the available rows
print("Processing row: {}".format(row)) # process each row however you want
# file exhausted, entering a 'waiting for new data' state where we manually read new lines
while True: # process ad infinitum...
reader = csv.reader(f.readlines()) # create a CSV reader for the new lines
for row in reader: # iterate over the new rows, if any
print("Processing new row: {}".format(row)) # process each row however you want
time.sleep(10) # wait 10 seconds before attempting again
注意可能会破坏此过程的边缘情况 - 例如,如果您在添加新行时尝试读取新行,则某些数据可能会丢失/拆分(取决于用于添加的刷新机制),如果您删除之前的行,阅读器可能会损坏等。如果可能的话,我建议控制 CSV 写入过程,使其明确通知您的处理例程。
更新:上面是逐行处理 CSV 文件,它永远不会被整个加载到工作内存中。实际在内存中加载多行的唯一部分是当文件更新发生时,它会拾取所有新行,因为以这种方式处理它们会更快,除非您期望两行之间有数百万行更新检查,内存影响可以忽略不计。但是,如果您还想逐行处理该部分,请按以下步骤操作:
import csv
import time
filename = "path/to/your/file.csv"
with open(filename, "rb") as f: # on Python 3.x use: open(filename, "r", newline="")
reader = csv.reader(f) # create a CSV reader
header = next(reader) # grab the first line and keep it as a header reference
print("CSV header: {}".format(header))
for row in reader: # iterate over the available rows
print("Processing row: {}".format(row)) # process each row however you want
# file exhausted, entering a 'waiting for new data' state where we manually read new lines
while True: # process ad infinitum...
line = f.readline() # collect the next line, if any available
if line.strip(): # new line found, we'll ignore empty lines too
row = next(csv.reader([line])) # load a line into a reader, parse it immediately
print("Processing new row: {}".format(row)) # process the row however you want
continue # avoid waiting before grabbing the next line
time.sleep(10) # wait 10 seconds before attempting again