单词重复的正则表达式答案

【问题标题】：regex for word repetition单词重复的正则表达式
【发布时间】：2015-02-06 00:50:45
【问题描述】：

我需要一个用于 sed 的正则表达式（请只使用 sed），它可以帮助我确定某个单词是否在一个单词中出现 3 次，所以打印这一行...

让我们说这是文件：

abc abc gh abc
abcabc abc
 ab ab cd ab xx ab
ababab cc ababab
abab abab cd abab

所以输出是：

P1 F1

abc abc gh abc
 ab ab cd ab xx ab
abab abab cd abab

这就是我正在尝试的

sed -n '/\([^ ]\+\)[ ]+\1\1\1/p' $1

它不起作用...：/我做错了什么？？

单词是否在开头并不重要，它们不需要按顺序出现

【问题讨论】：

看来你有很多功课...你已经问过how to compare first word in a line with the last word using sed?，你不是用Avinash的答案来获得更好的尝试吗？
我不明白你在问什么@fedorqui
重复的单词也不必是一行中的第一个单词吧？
剂量必须是第一个@anubhava
grep -E '(\b\w+\b)(.*?\b\1\b){2}' file 将为您提供所需的一切。

标签： regex unix sed

【解决方案1】：

您需要在\1之间添加.*

$ sed -n '/\b\([^ ]\+\)\b.*\b\1\b.*\b\1\b/p' file
abc abc gh abc
 ab ab cd ab xx ab
abab abab cd abab

我假设您的输入仅包含空格和单词字符。

【讨论】：

我真的不明白 \b 语法...我的老师没有解释它，看起来这会让thigs更短，你能解释一下吗？
\b 匹配单词字符和非单词字符。 A-Z 或 a-z 或 0-9 或 _ 。除了这些字符之外的任何字符都称为非单词字符。

【解决方案2】：

我知道它要求sed，但我看到的所有带有sed 的系统也有awk，所以这里有一个awk解决方案：

awk -F"[^[:alnum:]]" '{delete a;for (i=1;i<=NF;i++) a[$i]++;for (i in a) if (a[i]>2) {print $0;next}}' file
abc abc gh abc
 ab ab cd ab xx ab
abab abab cd abab

与正则表达式解决方案相比，这可能更容易理解。

awk -F"[^[:alnum:]]" # Set field separator to anything other than alpha and numerics characters.
'{
delete a            # Delete array "a"
for (i=1;i<=NF;i++) # Loop trough one by one word
    a[$i]++         # Store number of hits of word in array "a"
for (i in a)        # Loop trough the array "a"
    if (a[i]>2) {   # If one word is found more than two times:
        print $0    # Print the line
        next        # Skip to next line, so its not printed double if other word is found three times
    }
}' file             # Read the file

【讨论】：