如何在linux中只处理新的（未处理的）文件答案

【问题标题】：How to process only new (unprocessed) files in linux如何在linux中只处理新的（未处理的）文件
【发布时间】：2014-05-12 19:18:48
【问题描述】：

给定一个包含大量小文件 (>1 mio) 的目录，有什么方法可以快速记住哪些文件已被处理（用于数据库导入）。

我尝试的第一个解决方案是 bash 脚本：

#find all gz files
for f in $(find $rawdatapath -name '*.gz'); do
    filename=`basename $f`

    #check whether the filename is already contained in the process list
    onlist=`grep $filename $processed_files`
    if [[ -z $onlist ]]
        then
            echo "processing, new: $filename"
            #unzip file and import into mongodb

            #write filename into processed list
            echo $filename #>> $processed_files
    fi
done

对于较小的样本（160k 文件），这运行了大约 8 分钟（没有任何处理）

接下来我尝试了一个python脚本：

import os

path = "/home/b2blogin/webapps/mongodb/rawdata/segment_slideproof_testing"
processed_files_file = os.path.join(path,"processed_files.txt")
processed_files = [line.strip() for line in open(processed_files_file)]

with open(processed_files_file, "a") as pff:
  for root, dirs, files in os.walk(path):
      for file in files:
          if file.endswith(".gz"):
              if file not in processed_files:
                  pff.write("%s\n" % file)

这运行不到 2 分钟。

我忽略了一种明显更快的方法吗？

其他解决方案：

由于我使用 s3sync 下载新文件，将处理后的文件移动到不同的位置并不方便
由于文件名称中包含时间戳，我可能会考虑依赖按顺序处理它们，并且仅将名称与“最后处理”日期进行比较
或者，我可以跟踪上次运行处理的时间，并且只处理此后修改过的文件。

【问题讨论】：

跟踪时间可能是最快的

标签： python linux bash

【解决方案1】：

只使用一个集合：

import os

path = "/home/b2blogin/webapps/mongodb/rawdata/segment_slideproof_testing"
processed_files_file = os.path.join(path,"processed_files.txt")
processed_files = set(line.strip() for line in open(processed_files_file))

with open(processed_files_file, "a") as pff:
    for root, dirs, files in os.walk(path):
        for file in files:
            if file.endswith(".gz"):
                if file not in processed_files:
                    pff.write("%s\n" % file)

【讨论】：

很想看看这如何加快速度
太棒了 - 这将它从 120 秒降低到 0.5 秒

【解决方案2】：

使用标准命令行实用程序的替代方法：

只需diff 一个包含所有个文件列表的文件和一个包含已处理文件列表的文件。

很容易尝试，而且应该很快。

如果您在列表中包含完整的时间戳，您也可以通过这种方式获取“已更改”的文件。

【讨论】：

【解决方案3】：

如果文件在处理后没有被修改，一种选择是记住最新处理的文件，然后使用find的-newer选项检索尚未处理的文件.

find $rawdatapath -name '*.gz' -newer $(<latest_file) -exec process.sh {} \;

process.sh 的样子

#!/bin/env bash 
echo "processing, new: $1"
#unzip file and import into mongodb 
echo $1 > latest_file

这是未经测试的。在考虑实施此策略之前，请注意不必要的副作用。

如果可以接受 hacky/quick'n'dirty 解决方案，一个时髦的替代方法是在文件权限中编码状态（已处理或未处理），例如在组读取权限位中.假设您的umask 是022，因此任何新创建的文件都具有644 的权限，处理文件后将权限更改为600 并使用find 的-perm 选项检索not-yet-处理过的文件。

find $rawdatapath -name '*.gz' -perm 644 -exec process.sh {} \;

process.sh 的样子

#!/bin/env bash 
echo "processing, new: $1"
#unzip file and import into mongodb 
chmod 600 $1

同样，这是未经测试的。在考虑实施此策略之前，请注意不必要的副作用。

【讨论】：

我喜欢它的创意，感谢您的想法！