在大文件中查找单词并复制包含该单词的行[关闭]答案

【问题标题】：Find the word in large file and copy the line which contains that word [closed]在大文件中查找单词并复制包含该单词的行[关闭]
【发布时间】：2020-06-25 04:17:32
【问题描述】：

我有两个文件，即 File_A 和 File_B。 File_A 每行包含一个单词，File_B 包含句子。我必须从 File_A 读取单词并在 File_B 中搜索以该单词开头的行并将整行复制到 File_C。 File_A 和 File_B 都已排序

举例

文件_A：

he
I
there

文件_B：

he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.
we don't know what he is doing.

文件_C：

he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.

我尝试使用 shell 脚本，但它是启发式方法，因此需要很长时间。 File_A 和 File_B 都是大文件。

这是我尝试过的代码

#! /bin/bash

for first in `cat File_A`
do
    while read line 
    do
        first_col=$(echo $line|head -n1 | awk '{print $1;}')
        if [[ "$first" == "$first_col" ]]
        then
                 echo $line >> File_C
            fi  

    done <File_B
done

【问题讨论】：

如果您在自己解决此问题时遇到具体问题，可以在这里提问。您还应该首先决定要使用哪种编程语言。
请展示你的努力：将代码包含在问题中，即使它不起作用。
@MichaelButscher 我已经标记了编程语言。
@DYZ 这是我做的代码#！ /bin/bash for first in cat File_A do while read line do first_col=$(echo $line|head -n1 | awk '{print $1;}') if [[ "$first" == "$first_col" ] ] 然后回显 $line >> File_C fi 完成
请将其作为问题的一部分。您希望我们在评论中阅读未格式化的 shell 脚本吗？

标签： python python-3.x shell perl sh

【解决方案1】：

在理解<() 命令重定向的shell 中（如bash 或zsh 但不是posix sh）使用GNU grep：

grep -wf <(sed 's/^/^/' file_a) file_b > file_c

-f filename 从给定文件中读取模式/单词列表，在本例中是sed 's/^/^/' file_a 的输出，它在每行的开头放置一个^ 行首锚（如果您的 file_a 包含正则表达式中的特殊字符，这将无法正常工作），而 -w 仅匹配整个单词，以避免您的单词是一行中第一个单词的前缀的情况。

【讨论】：

"在 File_B 中搜索 以该词开始的行"
@DYZ Doh。固定。
@Shawn 感谢您的回答。但它做一些错误。它必须与单词完全匹配。例如。 File_A 包含单词“that”，然后它带来“that”、“that'll”、“that's”的所有句子。我希望只带“那个”。
@Amoll 撇号算作分词。您必须在从源词创建的正则表达式中添加一个尾随空格检查，而不是使用 -w - 像这样的边缘情况应该包含在您的示例数据中。

【解决方案2】：

请查看以下基于您的 shell 脚本创建的代码。

use strict;
use warnings;
use feature 'say';

my $file_a = 'File_A';
my $file_b = 'File_B';
my $file_c = 'File_C';

# read File_A into array @data_a
open my $fh_a, '<', $file_a
    or die "Couldn't open $file_a $!";

my @data_a = <$fh_a>;

close $fh_a;

# read File_B into array @data_b
open my $fh_b, '<', $file_b
    or die "Couldn't open $file_b $!";

my @data_b = <$fh_b>;

close $fh_b;

chomp @data_a;      # snip eol
chomp @data_b;      # snip eol

# store found result into File_C
open my $fh_c, '>', $file_c
    or die "Couldn't open $file_b $!";

for my $word_a (@data_a) {
    for my $line_b (@data_b) {
        say $fh_c $line_b if $line_b =~ /^$word_a\b/;
    }
}

close $fh_c;

输入文件_A

he
I
there

输入文件_B

he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.
we don't know what he is doing.

结果文件_C

he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
we don't know what he is doing.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.

【讨论】：

we don't know what he is doing. 根据 OP，此行不应在输出中
你有/\b$word_a\b/，你需要/^$word_a\b/。问题是“File_B 中以该单词开头的行”。
@Dave Cross -- 抱歉，我错过了 OP 帖子中的“开始”一词（已更正）

【解决方案3】：

Perl 中的类似内容：

#!/usr/bin/perl

use strict;
use warnings;

# Open File_A
open my $fh_a, '<', 'File_A' or die $!;

# Read words from File_A and remove newlines
chomp(my @words = <$fh_a>);

# Create a regex matching the words from File_A
# at the start of a line
my $word_re = '^(' . join('|', @words) . ')\b';
$word_re = qr($word_re);

# Open files B and C
open my $fh_b, '<', 'File_B' or die $!;
open my $fh_c, '>', 'File_C' or die $!;

# Read File_B a line at a time and write to
# File_C any lines that match our regex.
while (<$fh_b>) {
  print $fh_c $_ if /$word_re/;
}

【讨论】：