用于查找、搜索和替换文件中的字符串数组的 Shell 脚本答案

【问题标题】：Shell script to find, search and replace array of strings in a file用于查找、搜索和替换文件中的字符串数组的 Shell 脚本
【发布时间】：2011-03-14 04:37:17
【问题描述】：

这与我在Code golf: "Color highlighting" of repeated text 上提出的另一个问题/代码高尔夫有关

我有一个包含以下内容的文件“sample1.txt”：

LoremIpsumissimplydummytextoftheprintingandtypesettingindustry.LoremIpsumhasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimenbook.

我有一个脚本生成以下文件中出现的字符串数组（仅显示几个用于说明）：

LoremIpsum
LoremIpsu
dummytext
oremIpsum
LoremIps
dummytex
industry
oremIpsu
remIpsum
ummytext
LoremIp
dummyte
emIpsum
industr
mmytext

我需要（从顶部）查看文件 sample1.txt 中是否出现“LoremIpsum”。如果是这样，我想将所有出现的 LoremIpsum 替换为：<T1>LoremIpsum</T1>。现在，当程序移动到下一个单词“LoremIpsu”时，它不应与 sample1.txt 中的 <T1>LoremIpsum</T1> 文本匹配。它应该对这个“数组”的所有元素重复上述操作。下一个“有效”是“dummytext”，应标记为<T2>dummytext</T2>。

我确实认为应该可以为此创建一个 bash shell 脚本解决方案，而不是依赖 perl/python/ruby 程序。

【问题讨论】：

这听起来像是 sed 的工作，但我不清楚这个问题。
嗨，Marco - T2 示例有帮助吗？
为什么要使用shell脚本？为什么不使用最适合这项工作的工具呢？ Perl 是为低程序员时间的文本处理而制造的。
我正在运行一个 shell 脚本，它会生成您在上面看到的列表。我很想继续使用一个框架而不是混合-n-matching，但可以肯定 - 我也会选择 perl 解决方案...... perl 程序应该接受脚本输出中的列表！

标签： bash unix shell sed grep

【解决方案1】：

直接使用 Perl：

#! /usr/bin/perl

use warnings;
use strict;

my @words = qw/
  LoremIpsum
  LoremIpsu
  dummytext
  oremIpsum
  LoremIps
  dummytex
  industry
  oremIpsu
  remIpsum
  ummytext
  LoremIp
  dummyte
  emIpsum
  industr
  mmytext
/;

my $to_replace = qr/@{[ join "|" =>
                        sort { length $b <=> length $a }
                        @words
                     ]}/;

my $i = 0;
while (<>) {
  s|($to_replace)|++$i; "<T$i>$1</T$i>"|eg;
  print;
}

示例运行（包装以防止水平滚动）：

$ ./tag-words sample.txt
LoremIpsum只是dummytext印刷排版行业
试试。LoremIpsum已经是行业的标准dummytext6>自上世纪 500 年代以来，当一位不知名的印刷商在打字机上翻了翻打字机并进行了编辑以打出打字机时
样本书。

您可能会反对 qr// 和 @{[ ... ]} 的所有业务都处于巴洛克风格。使用/o 正则表达式开关可以获得与

中相同的效果

# plain scalar rather than a compiled pattern
my $to_replace = join "|" =>
                 sort { length $b <=> length $a }
                 @words;

my $i = 0;
while (<>) {
  # o at the end for "compile (o)nce"
  s|($to_replace)|++$i; "<T$i>$1</T$i>"|ego;
  print;
}

【讨论】：

嗨 gbacon - 嗯 - 第二个替换应该是“T2”，第三个 - “T3”...仅供参考 - 我知道它对你的代码的一个小改动
@RubiCon10 确认！谢谢并修复！

【解决方案2】：

纯 Bash（无外部）

在 Bash 命令行中：

$ sample="LoremIpsumissimplydummytextoftheprintingandtypesettingindustry.LoremIpsumhasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimenbook."
$ # or: sample=$(<sample1.txt)
$ array=(
LoremIpsum
LoremIpsu
dummytext
...
)
$ tag=0; for entry in ${array[@]}; do test="<[^>/]*>[^>]*$entry[^<]*</"; if [[ ! $sample =~ $test ]]; then ((tag++)); sample=${sample//${entry}/<T$tag>$entry</T$tag>}; fi; done; echo "Output:"; echo $sample
Output:
<T1>LoremIpsum</T1>issimply<T2>dummytext</T2>oftheprintingandtypesetting<T3>industry</T3>.<T1>LoremIpsum</T1>hasbeenthe<T3>industry</T3>'sstandard<T2>dummytext</T2>eversincethe1500s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimenbook.

【讨论】：