Linux shell脚本计算文本文件中字符序列的出现？答案

【问题标题】：Linux shell script to count occurance of char sequence in a text file?Linux shell脚本计算文本文件中字符序列的出现？
【发布时间】：2009-10-30 21:57:02
【问题描述】：

我有一个大文本文件（超过 70mb），需要计算一个字符序列在文件中出现的次数。我可以找到很多脚本来执行此操作，但没有一个考虑到序列可以在不同的行开始和结束。为了提高效率（实际上我正在处理的文件不止一个），我无法预处理文件以删除换行符。

示例：如果我正在搜索“thisIsTheSequence”，以下文件将有 3 个匹配项：

asdasdthisIsTheSequence
asdasdasthisIsT
heSequenceasdasdthisIsTheSequ
encesadasdasda

感谢您的帮助。

【问题讨论】：

您可以对文件进行预处理，只需在计数脚本之前的管道中进行：strip-newlines | count-matches。

标签： linux shell

【解决方案1】：

一个选项：

echo $((`tr -d "\n" < file | sed 's/thisIsTheSequence/\n/g' | wc -l` - 1))

可能有更有效的方法使用 shell 核心之外的实用程序 - 特别是如果您可以将文件放入内存中。

【讨论】：

【解决方案2】：

只需一个 awk 脚本即可，因为您将处理一个巨大的文件。执行多个管道会减慢速度。

#!/bin/bash
awk 'BEGIN{
 search="thisIsTheSequence"
 total=0
}
NR%10==0{
  c=gsub(search,"",s)
  total+=c  
}
NR{ s=s $0 }
END{ 
 c=gsub(search,"",s)
 print "total count: "total+c
}' file

输出

$ more file
asdasdthisIsTheSequence
asdasdasthisIsT
heSequenceasdasdthisIsTheSequ
encesadasdasdaasdasdthisIsTheSequence
asdasdasthisIsT
heSequenceasdasdthisIsTheSequ
encesadasdasda
asdasdthisIsTheSequence
asdasdasthisIsT
heSequenceasdasdthisIsTheSequ
encesadasdasda

$ ./shell.sh
total count: 9

【讨论】：

【解决方案3】：

您的序列中是否会出现不止一个换行符？

如果没有，一种解决方案是将序列分成两半并搜索两半（例如搜索“thisIsTh”和“eSequence”），然后返回找到的事件并“仔细查看” "，即去掉该区域中的换行符并检查是否匹配。

基本上这是一种对数据的快速“过滤”以找到有趣的东西。

【讨论】：

不，序列长 9 个字符。少于 9 个字符的行与搜索无关
在这种情况下，您可以搜索序列的两半。如果它被分成两行，那么您至少会找到其中的一半。这基本上是一种过滤技术，如果一半本身相当罕见，则效果很好（快速）。但实施起来有点费力。

【解决方案4】：

使用类似的东西：

head -n LL filename | tail -n YY | grep text | wc -l

其中 LL 是序列的最后一行，YY 是序列中的行数（即 LL - 第一行）

【讨论】：