查找和替换许多单词答案

【问题标题】：Finding and replacing many words查找和替换许多单词
【发布时间】：2012-01-04 20:15:14
【问题描述】：

我经常需要在文件中进行多次替换。为了解决这个问题，我创建了两个文件old.text 和new.text。第一个包含必须找到的单词列表。第二个包含应该替换那些单词的列表。

我的所有文件都使用 UTF-8 并使用各种语言。

我已经构建了这个脚本，我希望可以替换它。首先，它一次读取一行 old.text，然后将 input.txt 中该行的单词替换为 new.text 文件中的相应单词。

#!/bin/sh
number=1
while read linefromoldwords
do
    echo $linefromoldwords
    linefromnewwords=$(sed -n '$numberp' new.text)
    awk '{gsub(/$linefromoldwords/,$linefromnewwords);print}' input.txt >> output.txt
    number=$number+1
echo $number
done <  old.text

但是，我的解决方案效果不佳。当我运行脚本时：

在第 6 行，sed 命令不知道$number 的结束位置。
$number 变量正在更改为“0+1”，然后是“0+1+1”，此时它应该更改为“1”，然后是“2”。
awk 行似乎除了将 input.txt 完全复制到 output.txt 之外，没有做任何其他事情。

你有什么建议吗？

更新：

标记的答案效果很好，但是，我经常使用此脚本，并且需要很多小时才能完成。因此，我提供了一个可以更快完成这些替换的解决方案的赏金。 BASH、Perl 或 Python 2 中的解决方案是可以的，只要它仍然兼容 UTF-8。如果您认为使用 Linux 系统上常用的其他软件的其他解决方案会更快，那也可以，只要不需要大量依赖即可。

【问题讨论】：

您考虑过使用sed 吗？
我已经更新了脚本。 sed -i "s/ $i / $j /g" ./main.file - 在此操作中添加了空间。如果它不起作用，请告诉我，我们可以进一步研究。
您是否尝试过合并这两个文件并将其作为您的 sed 脚本文件？
我为此添加了另一个答案。不知道添加另一个而不是编辑现有的是否是个好主意。但希望它有所帮助。
我认为最快的解决方案可以很容易地用 C 编写。您是否只考虑脚本语言？

标签： ruby perl bash python-2.7

【解决方案1】：

第 6 行，sed 命令不知道 $number 的结束位置。

尝试用双引号引用变量

linefromnewwords=$(sed -n "$number"p newwords.txt)

$number 变量正在更改为“0+1”，然后是“0+1+1”，此时它应该更改为“1”，然后是“2”。

改为这样做：

number=`expr $number + 1`

带有 awk 的行似乎除了将 input.txt 完全复制到 output.txt 之外没有做任何其他事情。

awk 不会将变量置于其范围之外。 awk 中的用户定义变量需要在使用时定义或在 awk 的 BEGIN 语句中预定义。您可以使用 -v 选项包含 shell 变量。

这是bash 中的解决方案，可以满足您的需求。

Bash 解决方案：

#!/bin/bash

while read -r sub && read -r rep <&3; do
  sed -i "s/ "$sub" / "$rep" /g" main.file
done <old.text 3<new.text

此解决方案从substitution file 和replacement file 一次读取一行，并执行in-line sed 替换。

【讨论】：

【解决方案2】：

我发现一个通用的 perl 解决方案可以很好地用它们的关联值替换映射中的键：

my %map = (
    19 => 'A',
    20 => 'B',
);

my $key_regex = '(' . join('|', keys %map) . ')';

while (<>) {
    s/$key_regex/$map{$1}/g;
    print $_;
}

您必须首先将您的两个文件读入映射中（显然），但是一旦完成，您就只能对每一行进行一次遍历，并且每次替换都需要进行一次哈希查找。我只尝试了相对较小的地图（大约 1,000 个条目），因此无法保证您的地图是否大得多。

【讨论】：

【解决方案3】：

编辑 - 我刚刚注意到像我这样的两个答案已经在这里......所以你可以忽略我的:)

我相信这个 perl 脚本，虽然没有使用花哨的 sed 或 awk 东西，但完成工作相当快......

我确实冒昧地将 old_word 的另一种格式用于 new_word： csv 格式。如果操作太复杂，请告诉我，我将添加一个脚本，该脚本将使用您的 old.txt、new.txt 并构建 csv 文件。

带上它，然后告诉我！

顺便说一句 - 如果你们这里的 perl 专家可以建议一种更糟糕的方式来做我在这里做的事情，我会很乐意阅读评论：

    #! /usr/bin/perl
    # getting the user's input
    if ($#ARGV == 1)
        {
        $LUT_file = shift;
        $file = shift;
        $outfile = $file . ".out.txt";
        }
    elsif ($#ARGV == 2)
        {
        $LUT_file = shift;
        $file = shift;
        $outfile = shift;
        }
    else { &usage; }

    # opening the relevant files

    open LUT, "<",$LUT_file or die "can't open $signal_LUT_file for reading!\n : $!";
    open FILE,"<",$file or die "can't open $file for reading!\n : $!";
    open OUT,">",$outfile or die "can't open $outfile for writing\n :$!";

    # getting the lines from the text to be changed and changing them
    %word_LUT = ();
    WORD_EXT:while (<LUT>)
        {
        $_ =~ m/(\w+),(\w+)/;
        $word_LUT{ $1 } =  $2 ;
        }
    close LUT;

    OUTER:while ($line = <FILE>)
        {
        @words = split(/\s+/,$line);
        for( $i = 0; $i <= $#words; $i++)
            {
            if ( exists ($word_LUT { $words[$i] }) ) 
                {
                $words[$i] = $word_LUT { $words[$i] };
                }

            }
        $newline = join(' ',@words);
        print "old line - $line\nnewline - $newline\n\n";
        print OUT $newline . "\n";

        }   
    # now we have all the signals needed in the swav array, build the file.

        close OUT;close FILE;

    # Sub Routines
    #
    #

    sub usage(){
    print "\n\n\replacer.pl Usage:\n";
    print "replacer.pl <LUT file> <Input file> [<out file>]\n\n";
    print "<LUT file> -    a LookUp Table of words, from the old word to the new one.
    \t\t\twith the following csv format:
    \t\t\told word,new word\n";
    print "<Input file>       -    the input file\n";
    print "<out file>         -    out file is optional. \nif not entered the default output file will be: <Input file>.out.txt\n\n";

    exit;
    }

【讨论】：

【解决方案4】：

我不知道为什么以前的大多数发帖者都坚持使用正则表达式来解决这个任务，我认为这会比大多数人更快（如果不是最快的方法）。

use warnings;
use strict;

open (my $fh_o, '<', "old.txt");
open (my $fh_n, '<', "new.txt");

my @hay = <>;
my @old = map {s/^\s*(.*?)\s*$/$1/; $_} <$fh_o>;
my @new = map {s/^\s*(.*?)\s*$/$1/; $_} <$fh_n>;

my %r;
;  @r{@old} = @new;

print defined  $r{$_} ? $r{$_} : $_ for split (
  /(\s+)/, "@hay"
);

使用：perl script.pl /file/to/modify，结果打印到stdout。

【讨论】：

【解决方案5】：

这个 Python 2 脚本将旧词形成一个正则表达式，然后根据匹配的旧词的索引替换相应的新词。旧词仅在它们不同时才匹配。这种区别是通过包围 r'\b' 中的单词来实现的，它是正则表达式单词边界。

输入来自命令行（它们是我在空闲时用于开发的注释替代方案）。输出到标准输出

在此解决方案中，正文只扫描一次。使用 Jaypals 答案的输入，输出是相同的。

#!/bin/env python

import sys, re

def replacer(match):
    global new
    return new[match.lastindex-1]

if __name__ == '__main__':
    fname_old, fname_new, fname_txt = sys.argv[1:4]
    #fname_old, fname_new, fname_txt = 'oldwords.txt oldwordreplacements.txt oldwordreplacer.txt'.split()

    with file(fname_old) as f:
        # Form regular expression that matches old words, grouped in order
        old = '(?:' + '|'.join(r'\b(%s)\b' % re.escape(word)
                               for word in f.read().strip().split()) + ')'
    with file(fname_new) as f:
        # Ordered list of replacement words 
        new = [word for word in f.read().strip().split()]
    with file(fname_txt) as f:
        # input text
        txt = f.read()
    # Output the new text
    print( re.subn(old, replacer, txt)[0] )

我刚刚对一个约 100K 字节的文本文件做了一些统计：

Total characters in text: 116413
Total words in text: 17114
Total distinct words in text: 209
Top 10 distinct word occurences in text: 2664 = 15.57%

文本是从here 生成的 250 段 lorum ipsum 我只取了最常出现的十个单词，并按顺序将它们替换为字符串 ONE 到 TEN。

Python 正则表达式解决方案比 Jaypal 当前选择的最佳解决方案快一个数量级。 Python 选择将替换后跟换行符或标点符号以及任何空格（包括制表符等）的单词。

有人评论说 C 解决方案既易于创建又最快。几十年前，一些明智的 Unix 研究员观察到情况并非如此，并创建了诸如 awk 之类的脚本工具来提高生产力。此任务非常适合脚本语言，Python 中显示的技术可以在 Ruby 或 Perl 中复制。

稻田。

【讨论】：

【解决方案6】：

这是 Perl 中的一个解决方案。如果您将输入的单词列表合并为一个列表，则可以简化它：每一行都包含新旧单词的映射。

#!/usr/bin/env perl

# usage:
#   replace.pl OLD.txt NEW.txt INPUT.txt >> OUTPUT.txt

use strict;
use warnings;

sub read_words {
    my $file = shift;

    open my $fh, "<$file" or die "Error reading file: $file; $!\n";
    my @words = <$fh>;
    chomp @words;
    close $fh;

    return \@words;
}

sub word_map {
    my ($old_words, $new_words) = @_;

    if (scalar @$old_words != scalar @$new_words) {
        warn "Old and new word lists are not equal in size; using the smaller of the two sizes ...\n";
    }
    my $list_size = scalar @$old_words;
    $list_size = scalar @$new_words if $list_size > scalar @$new_words;

    my %map = map { $old_words->[$_] => $new_words->[$_] } 0 .. $list_size - 1;

    return \%map;
}

sub build_regex {
    my $words = shift;

    my $pattern = join "|", sort { length $b <=> length $a } @$words;

    return qr/$pattern/;
}

my $old_words = read_words(shift);
my $new_words = read_words(shift);
my $word_map = word_map($old_words, $new_words);
my $old_pattern = build_regex($old_words);

my $input_file = shift;
open my $input, "<$input_file" or die "Error reading input file: $input_file; $!\n";
while (<$input>) {
    s/($old_pattern)/$word_map->{$&}/g;
    print;
}
close $input;
__END__

旧词档案：

$ cat old.txt 
19
20

生词档案：

$ cat new.txt 
A
B

输入文件：

$ cat input.txt 
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf

创建输出：

$ perl replace.pl old.txt new.txt input.txt
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf

【讨论】：

【解决方案7】：

我喜欢这类问题，所以这是我的答案：

首先为了简单起见，为什么不只使用带有源和翻译的文件。我的意思是：（文件名更改此）

hello=Bye dudes
the morNing=next Afternoon
first=last

然后你可以在脚本中定义一个合适的分隔符。 (文件替换Words.sh)

#!/bin/bash

SEP=${1}
REPLACE=${2}
FILE=${3}
while read transline
do
   origin=${transline%%${SEP}*}
   dest=${transline##*${SEP}}
   sed -i "s/${origin}/${dest}/gI" $FILE
done < $REPLACE

举个例子（文件changeMe）

Hello, this is me. 
I will be there at first time in the morning

调用它

$ bash replaceWords.sh = changeThis changeMe

你会得到

Bye dudes, this is me.
I will be there at last time in next Afternoon

注意 sed 的“i”娱乐。 “-i”表示在源文件中替换，“I”在 s// 命令中表示忽略大小写 -a GNU 扩展，检查你的 sed 实现 -

当然请注意，bash while 循环比 python 或类似的脚本语言慢得多。根据您的需要，您可以做一个嵌套的 while，一个在源文件上，一个在内部循环翻译（更改）。将所有内容与标准输出相呼应以实现管道灵活性。

#!/bin/bash

SEP=${1}
TRANSLATION=${2}
FILE=${3}
while read line
do
   while read transline
   do
      origin=${transline%%${SEP}*}
      dest=${transline##*${SEP}}
      line=$(echo $line | sed "s/${origin}/${dest}/gI")
   done < $TRANSLATION
   echo $line
done < $FILE

【讨论】：

【解决方案8】：

这是一个 Python 2 脚本，它应该既节省空间又节省时间：

import sys
import codecs
import re

sub = dict(zip((line.strip() for line in codecs.open("old.txt", "r", "utf-8")),
               (line.strip() for line in codecs.open("new.txt", "r", "utf-8"))))

regexp = re.compile('|'.join(map(lambda item:r"\b" + re.escape(item) + r"\b", sub)))

for line in codecs.open("input.txt", "r", "utf-8"):
    result = regexp.sub(lambda match:sub[match.group(0)], line)
    sys.stdout.write(result.encode("utf-8"))

它在行动：

$ cat old.txt 
19
20
$ cat new.txt 
A
B
$ cat input.txt 
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf
$ python convert.py 
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf
$

编辑：提示@Paddy3118 处理空格。

【讨论】：

【解决方案9】：

这可能对你有用：

paste {old,new}words.txt | 
sed 's,\(\w*\)\s*\(\w*\),s!\\<\1\\>!\2!g,' | 
sed -i -f - text.txt

【讨论】：

【解决方案10】：

这应该通过某种方式减少时间，因为这样可以避免不必要的循环。

合并两个输入文件：

假设您有两个输入文件，old.text 包含所有替换，new.text 包含所有替换 em>。

我们将使用以下awk 单线创建一个新的文本文件，它将作为您的主文件的sed script：

awk '{ printf "s/ "$0" /"; getline <"new.text"; print " "$0" /g" }' old.text > merge.text 

[jaypal:~/Temp] cat old.text 
19
20

[jaypal:~/Temp] cat new.text 
A
B

[jaypal:~/Temp] awk '{ printf "s/ "$0" /"; getline <"new.text"; print " "$0" /g" }' old.text > merge.text

[jaypal:~/Temp] cat merge.text 
s/ 19 / A /g
s/ 20 / B /g

注意： 这种替换和替换的格式是基于您对单词之间有空格的要求。

使用合并文件作为 sed 脚本：

创建合并文件后，我们将使用 -f option 或 sed 实用程序。

sed -f merge.text input_file

[jaypal:~/Temp] cat input_file 
 12 adsflljl
 12 hgfahld
 12 ash;al
 13 a;jfda
 13 asldfj
 15 ;aljdf
 16 a;dlfj
 19 adads
 19 adfasf
 20 aaaadsf

[jaypal:~/Temp] sed -f merge.text input_file 
 12 adsflljl
 12 hgfahld
 12 ash;al
 13 a;jfda
 13 asldfj
 15 ;aljdf
 16 a;dlfj
 A adads
 A adfasf
 B aaaadsf

您可以使用> 运算符将此重定向到另一个文件。

【讨论】：

【解决方案11】：

在第 6 行，sed 命令不知道 $number 的结束位置。

linefromnewwords=$(sed -n '${number}p' newwords.txt)

我不确定引用，但 ${number}p 会起作用 - 可能是 "${number}p"

$number 变量正在更改为“0+1”，然后是“0+1+1”，此时它应该更改为“1”，然后是“2”。

bash 中的算术整数求值可以用 $(( )) 完成，比 eval (eval=evil) 要好。

number=$((number + 1))

一般来说，我会建议使用一个带有

的文件

s/ ni3 / nǐ /g
s/ nei3 / neǐ /g

以此类推，每行一个 sed-command，恕我直言，最好注意一下 - 按字母顺序排序，并将其用于：

sed -f translate.sed input > output

因此您始终可以轻松地比较映射。

s/\bni3\b/nǐ/g

可能比空格更适合作为显式分隔符，因为\b:=word boundary 匹配行首/行尾和标点字符。

【讨论】：

【解决方案12】：

为什么不去

paste -d/ oldwords.txt newwords.txt |\
sed -e 's@/@ / @' -e 's@^@s/ @' -e 's@$@ /g@' >/tmp/$$.sed

sed -f /tmp/$$.sed original >changed

rm /tmp/$$.sed

?

【讨论】：