使用 sed 使用来自另一个文本文件的字符串在文本文件中查找和替换答案

【问题标题】：Using sed to find-and-replace in a text file using strings from another text file使用 sed 使用来自另一个文本文件的字符串在文本文件中查找和替换
【发布时间】：2016-02-05 11:31:02
【问题描述】：

我有两个文件如下。第一个是sample.txt：

new haven co-op toronto on $1245
joe schmo co-op powell river bc $4444

第二个是locations.txt：

toronto
powell river
on
bc

我们想使用sed 生成一个标记的sample-new.txt，在每个之前和之后添加;。这样最终的字符串将如下所示：

new haven co-op ;toronto; ;on; $1245
joe schmo co-op ;powell river; ;bc; $4444

这可能使用 bash 吗？实际文件要长得多（每种情况下有数千行），但作为一次性工作，我们不太关心处理时间。

--- 编辑添加---

我原来的做法是这样的：

cat locations.txt | xargs -i sed 's/{}/;/' sample.txt

但它只对每个模式运行一次脚本，这与您在此处提出的方法相反。

【问题讨论】：

请附上您解决此问题的尝试in your question。只有看到您的代码，我们才能帮助您编写代码。
@ghoti 添加了我早期的可怕方法。谢谢！

标签： bash sed

【解决方案1】：

使用awk：

awk 'NR==FNR{a[NR]=$0; next;} {for(i in a)gsub("\\<"a[i]"\\>",";"a[i]";"); print} '  locations.txt sample.txt

使用awk+sed

sed -f <(awk '{print "s|\\<"$0"\\>|;"$0";|g"}' locations.txt) sample.txt

同样使用纯sed:

sed -f <(sed 's/.*/s|\\<&\\>|\;&\;|g/' locations.txt) sample.txt

（在您展示您的编码尝试后，我将添加解释为什么会这样。）

【讨论】：

好东西。由于使用了\< 和\>，仅Awk 的解决方案是特定于GNU 的； Mawk 和 BSD Awk 不支持这些。可以使 Sed 命令与 BSD Sed 一起工作，它可以理解 [[:<:]] 和 [[:>:]] 代替 \< 和 \>，如下所示：sed -f <(awk '{print "s|[[:<:]]"$0"[[:>:]]|;"$0";|g"}' locations.txt) sample.txt sed -f <(sed 's/.*/s|[[:<:]]&[[:>:]]|\;&\;|g/' locations.txt) sample.txt
好的，我正在搜索断词，例如\b。发布答案时找不到它...
不幸的是，对词边界断言的支持是一团糟，肯定是跨平台的，甚至跨给定平台的实用程序集。我知道在 Mawk 和 BSD Awk 中没有任何词边界断言。 \b 适用于 GNU Sed 和 GNU Grep（甚至 BSD Grep），但不适用于 GNU Awk。
而且，要明确一点：我认为这些都是很好的解决方案；我只想指出运行时环境的限制。
+1 表示这是 GNU awk 特有的警告。尽管这些很好，但如果该位置包含另一个位置作为子字符串，它们就会失败，例如“湖上的尼亚加拉”。（我不确定是否还有其他人。）

【解决方案2】：

只是为了完成你的选项集，你可以在纯 bash 中慢慢地做到这一点：

#!/usr/bin/env bash

readarray -t places < t2

while read line; do
  for place in "${places[@]}"; do
      line="${line/ $place / ;$place; }"
  done
  echo "$line"
done < t1

请注意，如果您包含其他地方内的地方，例如“on”中的“niagara on the lake”，这可能不会按预期工作：

foo bar co-op ;niagara ;on; the lake; on $1

相反，您可能希望进行更有针对性的模式匹配，这在 awk 中会容易得多：

#!/usr/bin/awk -f

# Collect the location list into the index of an array
NR==FNR {
  places[$0]
  next
}

# Now step through the input file
{

  # Handle two-letter provinces
  if ($(NF-1) in places) {
      $(NF-1)=";" $(NF-1) ";"
  }

  # Step through the remaining places doing substitutions as we find matches
  for (place in places) {
    if (length(place)>2 && index($0,place)) {
      sub(place,";"place";")
    }
  }

}

# Print every line
1

这适用于我使用您问题中的数据：

$ cat places
toronto
powell river
niagara on the lake
on
bc
$ ./tst places input
new haven co-op ;toronto; ;on; $1245
joe schmo co-op ;powell river; ;bc; $4444
foo nar co-op ;niagara on the lake; ;on; $1

如果您的地点文件包含一个包含两个字母的实际非省份，您可能会遇到问题。我不确定加拿大是否存在这种情况，但如果存在，您要么需要手动调整这些行，要么通过将省份与城市分开处理来使脚本更加复杂。

【讨论】：