【发布时间】:2021-01-31 12:35:28
【问题描述】:
我有下面的 C# 代码来从字符串中删除停用词:
public static string RemoveStopWords(string Parameter)
{
Parameter = Regex.Replace(Parameter, @"(?<=(\A|\s|\.|,|!|\?))($|_|0|1|2|3|4|5|6|7|8|9|A|about|after|all|also|an|and|another|any|are|as|at|B|be|because|been|before|being|between|both|but|by|C|came|can|come|could|D|did|do|does|E|each|else|F|for|from|G|get|got|H|had|has|have|he|her|here|him|himself|his|how|I|if|in|into|is|it|its|J|just|K|L|like|M|make|many|me|might|more|most|much|must|my|N|never|no|not|now|O|of|on|only|or|other|our|out|over|P|Q|R|re|S|said|same|see|should|since|so|some|still|such|T|take|than|that|the|their|them|then|there|these|they|this|those|through|to|too|U|under|up|use|V|very|W|want|was|way|we|well|were|what|when|where|which|while|who|will|with|would|X|Y|you|your|Z)(?=(\s|\z|,|!|\?))([^.])", " ", RegexOptions.IgnoreCase);
return Parameter.Trim();
}
但是当我运行它时,它会在停用词不在字符串末尾时起作用,例如:
about this book 输出为book
manager only 输出为manager only
only manager 输出为manager
谁能指导一下?
【问题讨论】:
-
([^.])我认为最后的那部分可能是你的问题。输入"manager only "会发生什么? (注意最后的空格) -
当我们最后有空间时,例如“仅限经理”被替换为“经理”
-
[^.]末尾的字符类期望出现单个字符。但是您使用积极的前瞻来断言直接在右边的是!?,一个空格字符或字符串的结尾。所以这部分([^.])也只能包含之前断言的内容,您可以省略前瞻并匹配它。您还可以通过使用字符类而不是使用|来总结单个字符的所有替代方案,从而稍微缩短模式。 -
例如
(?<=(?:\A|[\s.,!?]))(?:$|[A-Z0-9_]|about|after|all|also|and?|another|any|are|a[ts]|be|because|been|before|being|between|both|but|by|came|can|come|could|did|do|does|each|else|for|from|get|got|ha[ds]|have|her?|here|him|himself|his|how|i[nf]|into|i[st]|its|just|like|make|many|me|might|more|most|much|must|my|never|not?|now|o[fnr]|only|other|our|out|over|re|said|same|see|should|since|so|some|still|such|take|tha[tn]|the[nm]?|their|there|these|they|this|those|through|too?|under|up|use|very|want|wa[ys]|we|well|were|what|when|where|which|while|who|will|with|would|your?)(?=(?:\s|\z|[,!?])) -
非常感谢@the-fourth-bird,它工作得很好:)
标签: c# .net regex asp.net-core .net-core