如何优化读取和处理大文件？答案

【问题标题】：How to optimize reading in and working over a large file?如何优化读取和处理大文件？
【发布时间】：2016-03-28 01:18:49
【问题描述】：

我有一个脚本，它对从 API 返回到平面文件的数据作为 JSON 对象进行一些可怜的缓存。每行一个结果/JSON 对象。

缓存工作流程如下：

读取整个缓存文件 -> 逐行检查每一行是否太旧 -> 将不太旧的保存到新列表 -> 打印新的新缓存列表到文件中，并且还使用新列表作为过滤器，以不处理 API 调用的传入数据。

到目前为止，这个过程中最长的部分在上面用粗体表示。代码如下：

print "Reading cache file into memory ---"
with open('cache', 'r') as f:
    cache_lines = f.readlines()

print "Turning cache lines into json and checking if they are stale or not ---"
for line in cache_lines
    # Load the line back up as a json object
    try:
        json_line = json.loads(line)
    except Exception as e:
        print e

    # Get the delta to determine if data is stale.
    delta = meta_dict["timestamp_start"] - parser.parse(json_line['timestamp_start'])

    # If the data is still fresh then hold onto it
    if cache_timeout >= delta:
        fresh_cache.append(json_line)

这可能需要几分钟，具体取决于哈希文件的大小。有没有更快的方法来做到这一点？我知道读取整个文件一开始并不理想，但它最容易实现。

【问题讨论】：

标签： python performance file caching

【解决方案1】：

根据您的文件大小，它可能会导致内存问题。不知道是不是你遇到的那种问题。之前的代码可以这样改写：

delta = meta_dict['timestamp_start']
with open('cache', 'r') as f:
    while True:
        line = f.readline()
        if not line:
            break
        line = json.loads(line)
        if delta - parser.parse(line['timestamp_start']) <= cache_timeout:
            fresh_cache.append(json_line)

还有，

并不是说如果您使用dateutils 来解析日期，则每次调用都可能代价高昂。如果您的格式已知，您可能需要使用datetime 或dateutils 提供的标准转换工具
如果您的文件非常大并且fresh_cache 必须非常大，您可以使用另一个with 语句在中间文件上写入新鲜条目。

【讨论】：

感谢您的意见。我希望有一些黑魔法，但看起来我不走运。我会尝试不解析每个调用，看看是否有帮助。
你也可以试试simplejson库，比标准的json库要快...
反馈 - 1. simplejson 几乎没有效果。 2. 手动提取日期时间效果很好。它将时间从 8m11.578s 减少到 2m55.681s。这替换了上面的 parser.parse 行： datetime.datetime.strptime(json_line['timestamp_start'], "%Y-%m-%d %H:%M:%S.%f")
@Thisisstackoverflow，我想知道你是否可以用正则表达式识别时间戳部分。它将避免 json 解析并可能提高速度.. 不确定会有什么收获。

【解决方案2】：

回报 - 1. simplejson 几乎没有效果。 2. 手动提取日期时间效果很好。它减少了时间从 8m11.578s 到 2m55.681s。这取代了 parser.parse 从上面的线： datetime.datetime.strptime(json_line['timestamp_start'], "%Y-%m-%d %H:%M:%S.%f") –

【讨论】：