在 shell 脚本中优化 grep（或使用 AWK）答案

【问题标题】：Optimizing grep (or using AWK) in a shell script在 shell 脚本中优化 grep（或使用 AWK）
【发布时间】：2010-05-12 17:16:01
【问题描述】：

在我的 shell 脚本中，我尝试使用在 $sourcefile 中找到的术语一遍又一遍地针对同一个 $targetfile 进行搜索。

我的 $sourcefile 格式如下：

pattern1
pattern2
etc...

我必须搜索的低效循环是：

for line in $(< $sourcefile);do
    fgrep $line $targetfile | fgrep "RID" >> $outputfile
done

我知道可以通过将整个 $targetfile 加载到内存中或使用 AWK 来改进这一点？

谢谢

【问题讨论】：

你能不能只加入源文件和 egrep for (pattern1|pattern2...)？
好主意...不过需要 egrep 来获得 4000 个选项...模式会根据源文件中的行数而有所不同。

标签： shell awk grep

【解决方案1】：

我错过了什么，或者为什么不只是fgrep -f "$sourcefile" "$targetfile"？

【讨论】：

哇！这比其他两个更快。结果似乎也是正确的。我的意思是，闪电般的快。太棒了！

【解决方案2】：

一个 sed 解决方案：

sed 's/$.*$/\/\1\/p/' $sourcefile | sed -nf - $targetfile

这会将 $sourcefile 的每一行转换为 sed 模式匹配命令：

匹配字符串

到

/匹配字符串/p

但是，您需要转义特殊字符以使其健壮。

【讨论】：

谢谢！现在试试这个。似乎已经比使用 grep 更快了，尽管源文件有大约 4000 行并且正在搜索 300 meg 的目标文件，所以我预计它仍然需要一些时间。让我们看看会发生什么。

【解决方案3】：

使用 awk 读取源文件然后在目标文件中搜索（未测试）：

nawk '
    NR == FNR {patterns[$0]++; next}
    /RID/ {
        for (pattern in patterns) {
            # since fgrep considers patterns as strings not regular expressions, 
            # use string lookup and not pattern matching ("~" operator).
            if (index($0, pattern) > 0) {
                print
                break
            }
        }
    }
' "$sourcefile" "$targetfile" > "$outputfile"

还会用gawk。

【讨论】：

感谢您的建议，也将尝试一下。
非常快，但建议的 fgrep -f 更符合我的需要。