【问题标题】：How can I remove the stop words from sentence using shell script? [duplicate]如何使用 shell 脚本从句子中删除停用词？ [复制]
【发布时间】：2021-03-27 14:45:58
【问题描述】：

我正在尝试从文件中的句子中删除停用词？

我的意思是停止词：
[I, a, an, as, at, the, by, in, for, of, on, that]

我在my_text.txt 文件中有这些句子：

Unix 系统设计的主要目标之一是营造促进高效计划的环境

然后我想从上面的句子中删除停用词

我使用了这个脚本：

array=( I a an as at the by in for of on that  )
for i in "${array[@]}"
do
cat $p  | sed -e 's/\<$i\>//g' 
done < my_text.txt

但是输出是：

Unix 系统设计的主要目标之一是营造促进高效计划的环境

预期的输出应该是：

设计 Unix 系统的一个主要目标是创造一个促进高效计划的环境

注意：我要删除删除停用词而不是重复词？

【问题讨论】：

标签： bash shell sed tr

【解决方案1】：

像这样，假设$p 是一个现有文件：

 sed -i -e "s/\<$i\>//g" "$p"

你必须使用双引号，而不是单引号来扩展变量。

-i 开关替换一行。

了解如何在 shell 中正确引用，这非常重要：

“双引号”包含空格/元字符和每个扩展的每个文字："$var"、"$(command "$var")"、"${array[@]}"、"a & b"。将'single quotes' 用于代码或文字$'s: 'Costs $5 US'、ssh host 'echo "$HOSTNAME"'。见
http://mywiki.wooledge.org/Quotes
http://mywiki.wooledge.org/Arguments
http://wiki.bash-hackers.org/syntax/words

终于

array=( I a an as at the by in for of on that  )
for i in "${array[@]}"
do
    sed -i -e "s/\<$i\>\s*//g" Input_File 
done

奖金

尝试不使用\s* 以了解我添加此正则表达式的原因

【讨论】：

当我使用上述脚本时，没有显示输出？您确定代码或脚本正确吗？ array=( I a an as at the by in for of on that ) for i in "${array[@]}" do sed -i -e "s/\<$i\>//g" my_text.txt done 没有显示输出，输出为： > ~/project$ ./remove.sh > ~/project$
是的，文件被修改：cat my_text.txt

【解决方案2】：

一个在awk。这是一个有效的道具，但需要适当的标点符号处理，然后是一些（幸运的是你的数据没有）：

$ awk '
NF==FNR {                         # process stop words
    split($0,a,/,/)               # comma separated without space
    for(i in a)                   # they go to b hash
        b[a[i]]
    next
}
{                                 # reading the text
    for(i=1;i<=NF;i++)            # iterating them words
        if(!($i in b))            # if current word notfound in stop words
            printf "%s%s",$i,OFS  # output it (leftover space in the end, sorry)
        print ""                  # newline in the 
}' words text

输出：

One primary goals design Unix system was to create environment promoted efficient program

为什么是 awk？ Shell 是用于管理文件和启动程序的工具。除了在其他地方处理得更好的所有部分。

【讨论】：

【解决方案3】：

我也非常喜欢在文本处理中使用 awk。假设输入数据是mytext.txt文件，script是包含以下代码的文件，只需运行awk -f mytext.txt script即可。

此外，通过更改stopwords 变量，这应该更容易在需要时更改停用词。请记住，mytext.txt 和 stopwords 都只能包含空格分隔的单词。

BEGIN {
stopwords = "I a an as at the by in for of on that"
split(stopwords, wordarray)
ORS = " "
RS = " "
}

{
equals = 0
for (w in wordarray)
  if ($0 == wordarray[w])
    equals = 1
if (equals == 0) print $0
}

【讨论】：

【解决方案4】：

可以使用这个脚本：

while read p 
do 
  echo $p | sed -e 's/\<I\>//g' | sed -e 's/\<an\>//g' | sed -e 's/\<a\>// g'|sed -e 's/\<as\>//g'|sed -e 's/\<at\>//g'|sed -e 's/\<the\>//g' | sed -e 's/\<by\>//g' | sed -e 's/\<in\>//g' | sed -e 's/\<for\>//g' | sed -e 's/\<of\>//g' | sed -e 's/\<on\>//g' > my_text.txt
  
  cat my_text.txt

done < my_text.txt

那么输出一定是这样的：

设计 Unix 系统的一个主要目标是创造一个促进高效的环境程序

【讨论】：