awk 删除以模式结尾的词的结尾答案

【问题标题】：awk remove endings of words ending with patternsawk 删除以模式结尾的词的结尾
【发布时间】：2021-12-22 15:07:51
【问题描述】：

我有一个大型数据集，并且正在尝试使用 awk 对一列 ($14) 进行词形还原，如果它以其中一种模式结尾，我需要删除单词中的 'ing'、'ed'、's'。所以问，问，问毕竟只是“问”。

假设我有这个数据集（我要修改的列是 $2：

onething 这是一个经过多次测试的字符串。 twoed 我想删除以许多模式结尾的单词。三人组看书是件好事。

这样，预期的输出是：

我测试了多次的东西。 twoed 我想用许多模式删除单词结尾。三人组读的书我很好。

我尝试过使用 awk 来跟踪正则表达式，但没有成功。

awk -F'\t' '{gsub(/\(ing|ed|s\)\b/," ",$2); print}' file.txt  

#this replaces some of the words with ing and ed, not all, words ending with s stays the same (which I dont want)

请帮忙，我是 awk 的新手，还在探索它。

【问题讨论】：

标签： awk gsub

【解决方案1】：

将 GNU awk 用于 gensub() 和 \> 用于字边界：

$ awk 'BEGIN{FS=OFS="\t"} {$2=gensub(/(ing|ed|s)\>/,"","g",$2)} 1' file
onething        Thi i a str that i test multiple time.
twoed   I want to remove word end with many pattern.
threes  Read book i good th.

【讨论】：

啊，谢谢！您的代码对我不起作用，但我对其进行了一些修改，它起作用了。
它完全符合您的要求，并且正如您在我的回答中看到的那样，从您提供的输入中产生了您想要的输出 - 它以何种方式不适合您以及以何种方式你修改了吗？
不管怎样，不客气，接下来要做什么请看stackoverflow.com/questions/tagged/awk。
我唯一改变的是 BEGIN 部分：awk -F'\t' 'BEGIN {OFS="\t"}，我认为它与你的没有太大不同。谢谢！！
您所做的只是从我在脚本中的正确方法（将两个变量一起分配给单个值）更改为错误的方法（将每个变量分别分配给相同的值）从维护的角度来看，但它没有任何功能差异。如果这是您所做的唯一更改并且您所做的更改有效，那么我发布的脚本也有效。

【解决方案2】：

使用任何awk 和gsub 你可以这样做：

awk -F'\t' -v OFS="\t" '
    { gsub(/(s|ed|ing)[.[:blank:]]/," ",$2)
      match($2,/[.]$/) || sub(/[[:blank:]]$/,".",$2)
    }1
' file

输入文件示例

$ cat file
onething        This is a string that is tested multiple times.
twoed   I wanted to remove words ending with many patterns.
threes  Reading books is good thing.
four    Just a normal sentence.

使用/输出示例

$ awk -F'\t' -v OFS="\t" '
>     { gsub(/(s|ed|ing)[.[:blank:]]/," ",$2)
>       match($2,/[.]$/) || sub(/[[:blank:]]$/,".",$2)
>     }1
> ' file
onething        Thi i a str that i test multiple time.
twoed   I want to remove word end with many pattern.
threes  Read book i good th.
four    Just a normal sentence.

（注意：最后一行添加为未更改的句子示例）

【讨论】：

【解决方案3】：

如果你使用 GNU awk，你离它不远：

$ awk -F'\t' -v OFS='\t' '{gsub(/ing|ed|s\>/,"",$2); print}' file.txt
onething    Thi i a str that i test multiple time.
twoed   I want to remove word end with many pattern.
threes  Read book i good th.

注意-v OFS='\t' 也可以使用制表符作为输出字段分隔符。

但是，如果您的 awk 使用了那种没有字边界的过时正则表达式（例如 macOS 附带的默认 awk），事情就会变得更加复杂。一种选择是迭代使用match 和substr。示例：

# foo.awk
BEGIN {
  n = split(prefix, word, /,/)
  for(i = 1; i <= n; i++) {
    len[i] = length(word[i])
  }
}
{
  for(i = 1; i <= n; i++) {
    re = word[i] "[^[:alnum:]]"
    while(m = match($2, re)) {
      if(m == 1) {
        $2 = substr($2, len[i]+1, length($2))
      } else {
      $2 = substr($2, 1, m-1) substr($2, m+len[i], length($2))
      }
    }
  }
  print
}

然后：

$ awk -F'\t' -v OFS='\t' -v prefix="ing,ed,s" -f foo.awk file.txt
onething    Thi i a str that i test multiple time.
twoed   I want to remove word end with many pattern.
threes  Read book i good th.

【讨论】：