【问题标题】:Find the word in large file and copy the line which contains that word [closed]在大文件中查找单词并复制包含该单词的行[关闭]
【发布时间】:2020-06-25 04:17:32
【问题描述】:

我有两个文件,即 File_A 和 File_B。 File_A 每行包含一个单词,File_B 包含句子。我必须从 File_A 读取单词并在 File_B 中搜索以该单词开头的行并将整行复制到 File_C。 File_A 和 File_B 都已排序

举例

文件_A:

he
I
there

文件_B:

he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.
we don't know what he is doing.

文件_C:

he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.

我尝试使用 shell 脚本,但它是启发式方法,因此需要很长时间。 File_A 和 File_B 都是大文件。

这是我尝试过的代码

#! /bin/bash

for first in `cat File_A`
do
    while read line 
    do
        first_col=$(echo $line|head -n1 | awk '{print $1;}')
        if [[ "$first" == "$first_col" ]]
        then
                 echo $line >> File_C
            fi  

    done <File_B
done

【问题讨论】:

  • 如果您在自己解决此问题时遇到具体问题,可以在这里提问。您还应该首先决定要使用哪种编程语言。
  • 请展示你的努力:将代码包含在问题中,即使它不起作用。
  • @MichaelButscher 我已经标记了编程语言。
  • @DYZ 这是我做的代码#! /bin/bash for first in cat File_A do while read line do first_col=$(echo $line|head -n1 | awk '{print $1;}') if [[ "$first" == "$first_col" ] ] 然后回显 $line >> File_C fi 完成
  • 请将其作为问题的一部分。您希望我们在评论中阅读未格式化的 shell 脚本吗?

标签: python python-3.x shell perl sh


【解决方案1】:

在理解&lt;() 命令重定向的shell 中(如bashzsh 但不是posix sh)使用GNU grep

grep -wf <(sed 's/^/^/' file_a) file_b > file_c

-f filename 从给定文件中读取模式/单词列表,在本例中是sed 's/^/^/' file_a 的输出,它在每行的开头放置一个^ 行首锚(如果您的 file_a 包含正则表达式中的特殊字符,这将无法正常工作),而 -w 仅匹配整个单词,以避免您的单词是一行中第一个单词的前缀的情况。

【讨论】:

  • "在 File_B 中搜索 以该词开始的行"
  • @DYZ Doh。固定。
  • @Shawn 感谢您的回答。但它做一些错误。它必须与单词完全匹配。例如。 File_A 包含单词“that”,然后它带来“that”、“that'll”、“that's”的所有句子。我希望只带“那个”。
  • @Amoll 撇号算作分词。您必须在从源词创建的正则表达式中添加一个尾随空格检查,而不是使用 -w - 像这样的边缘情况应该包含在您的示例数据中。
【解决方案2】:

请查看以下基于您的 shell 脚本创建的代码。

use strict;
use warnings;
use feature 'say';

my $file_a = 'File_A';
my $file_b = 'File_B';
my $file_c = 'File_C';

# read File_A into array @data_a
open my $fh_a, '<', $file_a
    or die "Couldn't open $file_a $!";

my @data_a = <$fh_a>;

close $fh_a;

# read File_B into array @data_b
open my $fh_b, '<', $file_b
    or die "Couldn't open $file_b $!";

my @data_b = <$fh_b>;

close $fh_b;

chomp @data_a;      # snip eol
chomp @data_b;      # snip eol

# store found result into File_C
open my $fh_c, '>', $file_c
    or die "Couldn't open $file_b $!";

for my $word_a (@data_a) {
    for my $line_b (@data_b) {
        say $fh_c $line_b if $line_b =~ /^$word_a\b/;
    }
}

close $fh_c;

输入文件_A

he
I
there

输入文件_B

he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.
we don't know what he is doing.

结果文件_C

he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
we don't know what he is doing.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.

【讨论】:

  • we don't know what he is doing. 根据 OP,此行不应在输出中
  • 你有/\b$word_a\b/,你需要/^$word_a\b/。问题是“File_B 中以该单词开头的行”。
  • @Dave Cross -- 抱歉,我错过了 OP 帖子中的“开始”一词(已更正)
【解决方案3】:

Perl 中的类似内容:

#!/usr/bin/perl

use strict;
use warnings;

# Open File_A
open my $fh_a, '<', 'File_A' or die $!;

# Read words from File_A and remove newlines
chomp(my @words = <$fh_a>);

# Create a regex matching the words from File_A
# at the start of a line
my $word_re = '^(' . join('|', @words) . ')\b';
$word_re = qr($word_re);

# Open files B and C
open my $fh_b, '<', 'File_B' or die $!;
open my $fh_c, '>', 'File_C' or die $!;

# Read File_B a line at a time and write to
# File_C any lines that match our regex.
while (<$fh_b>) {
  print $fh_c $_ if /$word_re/;
}

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2023-03-21
    • 2021-02-13
    • 2016-04-07
    • 1970-01-01
    • 2021-10-09
    • 2017-04-26
    • 1970-01-01
    相关资源
    最近更新 更多