查找包含另一个文件中所有单词/行的所有文件答案

【问题标题】：Find all files that contain all words/lines in another file查找包含另一个文件中所有单词/行的所有文件
【发布时间】：2014-07-29 00:41:44
【问题描述】：

我打算直接说这是一个家庭作业问题，但我觉得我已经用尽了在线搜索与如何解决此问题相关的任何内容，或者我只是没有为 Google/正确措辞堆栈溢出。

问题的开头是这样的：文件 words 包含一个单词列表。每个单词都在单独的行上。文件 story1、story2、...、story100 是短篇小说。

这是一个多部分的问题，但最后一部分难倒我：找出包含文件 words 中所有单词的故事文件。

之前有一个类似的问题：从文件words中找出至少包含一个单词的故事文件（打印文件名）。

这个是我用 grep 解决的：

grep -l -f words story*

我的印象是我还必须使用 grep 来解决最后一个问题，但我似乎找不到 grep 的选项或任何只返回与模式文件中的所有内容匹配的文件的选项.看来我可能必须使用 shell 脚本来执行此操作，但不确定从哪里开始，或者我什至需要 grep 来完成此操作。有关如何解决此问题的任何指示？

谢谢！

编辑：

这些是教师给出的解决方案中的正确答案。

主要问题之前的问题： grep -l -f words story*

主要问题：

for story in `ls story*`
do
    (( match = 0 ))

    for word in `cat words`
    do
        if [ `grep -l $word $story` ]
        then
            (( match++ ))
        else
            break
        fi
    done

    if [ $match -eq `wc -w < words` ]
    then
        echo $story
    fi
done

感谢大家深思熟虑的意见和回答，很抱歉我迟到了一点。

【问题讨论】：

想不出只使用grep 而不使用eval/evil 的方法。但是您可以遍历文件并逐行读取每个文件，然后如果发现不存在的行不打印任何内容，否则在循环完成后打印文件名。此外，这只是一个建议，但您可能需要查看大括号扩展 {1..#}，因为虽然您对另一个问题的解决方案在技术上涵盖了 story1..100，但它也可能捕获错误文件。
感谢您的提示，BroSlow，我会尝试的！哈哈，我没听懂，也感谢您指出这一点！我很高兴能真正找到解决这个问题的方法，这让我有点不知所措。
grep -l -f words story* 并没有按照您的想法行事。它告诉 grep 在story* 中查找与words 中包含的正则表达式匹配的文本。最明显的问题是，如果words 包含单词the，其中一个故事文件包含单词then，那么grep 将报告自@987654335 以来在该文件中找到RE the @ 匹配 then 的开头。您需要使用至少提供单词边界的工具，例如GNU awk。

标签： linux bash awk sed grep

【解决方案1】：

# wcheck: finds story* files that contain all words in words file

# for each file named story... (in this directory)
for file in story*
do
    stGood=0  # story is intialized as containing words or true

    ## for each word in the words file
    for word in $(cat words) ; do

        ## if test using grep exit status for existance of word
        if ! grep -q -F $word $file
        then
            stGood=1 #if word is not found story is set to false
            break
        fi   
    done
    ## if story is still true then filename is printed
    if [ $stGood == 0 ]
        then
        echo $file
    fi
done
exit

【讨论】：

【解决方案2】：

蛮力方法可能不是最快的方法，但只要你没有 100,000 多个单词和故事，就可以了。基本上，您只需使用 grep 测试每个文件是否包含每个单词，一次一个。如果 grep 无法在故事中找到单词，请继续下一个故事。如果在story 中找到所有单词，则将story 添加到goodstories 数组中。最后，只需打印所有商品故事：

#!/bin/bash

declare -a words        # array containing all words
declare -a goodstories  # array contianing stories with all words

words=( `< /path/to/words` )    # fill words array

## for each stories file (assumed they exist in dir of their own)
for s in `find /path/to/stories/base/dir -type f` ; do

    wfound=0                    # all words found flag initialized to 'true'

    ## for each word in words
    for w in ${words[@]}; do

        ## test that word is in story, if not set wfound=1 break
        grep -q $w $s &>/dev/null || {

            wfound=1
            break

        }

    done

    ## if grep found all words, add story to goodstories array
    test "$wfound" -eq 0 && goodstories+=( $s )

done

## output your list of goodstories

if test "${#goodstories[@]}" -gt 0 ; then

    echo -e "\nStories that contained all words:\n"
    for s in ${goodstories[@]}; do

        echo "  $s"

    done

else

    echo "No stories contained all words"

fi

exit 0

注意：我没有创建文字或故事文件，因此如果您发现拼写错误等。请将该代码视为伪代码。然而，它也不仅仅是被拍打在一起......

【讨论】：

如果the 在words 中并且there 在story 文件中，则该grep 命令将错误地报告该文件中存在单词the。您需要 -w 用于 grep 的 arg（可能是 GNU ]only？）。一般建议：不要使用已弃用的刻度来执行命令，请改用$(..)。不要在 find 的输出上使用 for 循环，而是使用 find ... | while IFS= read -r s。使用(( var == 0 )) 进行算术运算，而不是test $var == 0。最后不需要exit 0，这是默认设置。
Ed，我很欣赏这些提示，但至于反引号，你依靠什么参考来断定它们已被弃用？我传统上使用$()，但后来其他人抱怨可移植性。我以同样的方式看待建议的其余部分。 find | while——我一次又一次地看到消除管道的争论，因此for。 test blah 也只是为了便携。是的，exit 0 是默认设置，但它类似于点i's 和交叉t's。有无数相互矛盾的“这样做”，如果有一个商定的标准，那就太好了。
也许 deprecated 这个词太强了，但反引号在各方面都较差，并且仅在最古老的不兼容 POSIX 的 shell 中是必需的（请参阅mywiki.wooledge.org/BashFAQ/082），因此将它们用于可移植性没有用. find | while 是处理包含空格的文件名的简单方法，如果您不想要管道，还有其他选择，但 for file in $(find...) 不是一个选项。 test "$var" -eq 0 比 (( var == 0 )) 可读性差，我怀疑基于 var 的值有更多警告。 exit 0你可以添加，如果你喜欢，但它不会做任何事情。
好的，我可以使用while IFS='as needed' -r s; do ..stuff..; done <<< $(find stuff)并消除管道；但我也没有看到for file in $(find stuff); do.. 的任何禁忌症。我可能遗漏了一些我（目前）不知道的东西，但是for .. in $(find ..) 有什么问题？您可以根据相同的IFS=$'stuff' 断言它；就像文件一样，所以分词不是问题。是什么让您说这不是一种选择？
很公平。感谢您的讨论和提示，这对我们所有人都有好处（尤其是我）。

【解决方案3】：

假设您的 words 文件不包含 GNU awk for \<...\> 工作边界的 RE 元字符：

列出包含一个单词的文件：

awk '
NR==FNR { words["\\<" $0 "\\>"]; next }
{
    for (word in words) {
        if ($0 ~ word) {
            print FILENAME
            next
        }
    }
}
' words story*

列出包含所有单词的文件（GNU awk 用于另外的 ENDFILE、delete(array) 和 length(array)）：

awk '
NR==FNR { words["\\<" $0 "\\>"]; next }
{
    for (word in words) {
        if ($0 ~ word) {
            found[word]
        }
    }
}
ENDFILE {
    if ( length(found) == length(words) ) {
        print FILENAME
    }
    delete found
}
' words story*

【讨论】：

第二个不会失败，因为缺少的将始终设置为 1，因为其他每个单词都不匹配该行。 ($0 ~ word){found =1} 不是更好吗？
啊，好点子，第二个写的将要求故事文件的每一行的每个单词都匹配，而不仅仅是文件中的一次。您的提议不起作用，因为它会为找到的任何单词设置 found 而我们需要在找到所有单词时设置它。我刚刚更新了我的答案。我想知道这是否就是反对票的原因？
我留下的评论太长，无法编辑。我的意思是设置一个像found 这样的标志，然后你可以增加。如果它与字数匹配，则将其打印出来。与您将其更改为的非常相似:)
增加一个变量是行不通的，因为同一个词可能会出现多次。你需要为这个数组做一些事情，要么添加到一个，要么从单词中删除，或者......
哦，是的，没有正确考虑。如果这个词出现一次，我的会起作用。

【解决方案4】：

如果您有一个 unique 单词列表要搜索，并且对于每个故事，它包含的 unique 单词列表，使用fgrep -c 更容易解决问题：

# remove duplicates words in a file
# place them one per line
function buildWordList() {
    sed -e 's/[^[:alpha:]][^[:alpha:]]*/'$"\n"'/g' "$1" |
           tr [:upper:] [:lower:] | sort -u | sed '/^$/d'
    #      ^^^^^^^^^^^^^^^^^^^^^^
    #      Works for English. 
}

TMP=$(mktemp -d)
trap "rm -rf $TMP" EXIT

buildWordList word | sed /.*/^@$/ > $TMP/Words
#                        ^^^^^^^^
#                     force whole word matching (as we have 1 word/line)
#                     `grep -w` might have been instead below. But I don't
#                     know if this is GNU-specific though
count=$(wc -l < $TMP/Words)

for file in story*
 do
    # build a list of unique words in the story, one per line
    buildWordList "${file}" > $TMP/FileWords
    if [ $( grep -c -f $TMP/Words $TMP/FileWords ) -eq $count ]
     then
       echo "${file}"
     fi
 done

【讨论】：

$ 在您的 sed 命令 (sed -e $'<script>') 开始时做了什么？请注意，这将匹配 the 和 there，请参阅我在此线程中的其他 cmets。您需要添加一个陷阱来删除您正在创建的 tmp 文件。
@EdMorton 感谢 Ed 指出“我们的”答案中的各种错误！我在sed 命令的开头修复了额外的$，使用mktemp 创建了一个临时目录，最重要的是，可能已经修复了有关“匹配前缀”的问题" (the 与 there)。我说可能因为我现在无法测试它......

【解决方案5】：

for EachFile in story*
 do
    sed 's/  */\
/g' ${EachFile} | sort -u > /tmp/StoryInList
    if [ $( fgrep -w -c -v -f /tmp/StoryInList words ) -eq 0 ]
     then
       echo ${EachFile}
     fi
 done
rm /tmp/StoryInList

批量编写一些代码，但使用 grep 强度即使有几千个单词也能完成这项工作

【讨论】：

单词并不总是由空白字符分隔（请参阅@SylvainLeroux 是如何做到的），并且会匹配 the 和 there（请参阅我的其他 cmets）。
哦，非常好的评论，我想念它，调整回复添加选项-w 仅在 (f)grep 中比较单词。