从段落中获取包含关键字的句子答案

【问题标题】：Get sentences containing a keyword from paragraph从段落中获取包含关键字的句子
【发布时间】：2016-09-18 14:12:41
【问题描述】：

我需要从段落中提取包含单词island 或Island 的句子。每个句子都以大写字母开头，以句点结尾。

段落为字符串

" The islands were settled from the second century AD by a series of local empires. In 1819, Sir Stamford Raffles founded modern Singapore as a trading post of the East India Company; after the company collapsed, the islands were ceded to Britain and became part of its Straits Settlements in 1826. During World War II, Singapore was occupied by Japan. It gained independence from Britain in 1963, by uniting with other former British territories to form Malaysia, but was expelled two years later over ideological differences. After early years of turbulence, and despite lacking natural resources and a hinterland, the nation developed rapidly as an Asian Tiger economy, based on external trade and its human capital. "（来源：https://en.wikipedia.org/wiki/Singapore）

作为数组元素的理想结果：

这些岛屿从公元 2 世纪开始由一系列地方帝国定居。
1819年，斯坦福莱佛士爵士创立现代新加坡，作为东印度公司的贸易站；公司倒闭后，这些岛屿被割让给英国，并于 1826 年成为其海峡殖民地的一部分。

我找到了其他语言的示例，例如 Java (Regex to find sentence containing specific word (java) from paragraph)。但是，同样的 Regex 不适用于 Ruby。

使用 Ruby 可以做到这一点吗？

【问题讨论】：

标签： ruby regex

【解决方案1】：

此解决方案为示例文本生成正确的结果。

text = " The islands were settled from the second century AD by a series of local empires. In 1819, Sir Stamford Raffles founded modern Singapore as a trading post of the East India Company; after the company collapsed, the islands were ceded to Britain and became part of its Straits Settlements in 1826. During World War II, Singapore was occupied by Japan. It gained independence from Britain in 1963, by uniting with other former British territories to form Malaysia, but was expelled two years later over ideological differences. After early years of turbulence, and despite lacking natural resources and a hinterland, the nation developed rapidly as an Asian Tiger economy, based on external trade and its human capital."

matches = text.scan(/\b[A-Z][^.]+[Ii]sland[^.]+?\./)

matches.each do |match|
  puts "Found: #{match}"
end

这会产生以下输出：

Found: The islands were settled from the second century AD by a series of local empires.
Found: In 1819, Sir Stamford Raffles founded modern Singapore as a trading post of the East India Company; after the company collapsed, the islands were ceded to Britain and became part of its Straits Settlements in 1826.

根据提供的链接，只需稍作改动即可添加对其他句子终止符（例如“！”和“？”）的额外支持：

matches = text.scan(/\b[A-Z][^.!?]+[Ii]sland[^.!?]+?[.!?]/)

【讨论】：

【解决方案2】：

我建议使用两个正则表达式，一个将字符串分成句子，另一个提取包含单词“island”或“islands”的句子，第一个字母可能大写。

str.split(/(?<=\.)\s+/).select { |s| s =~ /\b[iI]slands?\b/ }
  #=> ["The islands were settled from the second century AD by a series of local empires.",
  #    "In 1819, Sir Stamford Raffles founded modern Singapore as a trading post of
  #     the East India Company; after the company collapsed, the islands were ceded to
  #     Britain and became part of its Straits Settlements in 1826. *

/(?<=\.)\s+/ 匹配正向后视中的句点，后跟一个或多个空格。
/\b[iI]slands?\b/ 匹配字符串“island”、“Island”、“islands”和“Islands”，并用分词符包围（以避免匹配，例如“islander”）。

^{* 我在此处添加了两个换行符以使其更具可读性。}

【讨论】：

【解决方案3】：

是的。按照你说的，最直接的可能是：

string.scan(/(?=[A-Z])[^.]*island[^.]*\./i)
# => [
#   "The islands were settled from the second century AD by a series of local empires.",
#   "In 1819, Sir Stamford Raffles founded modern Singapore as a trading post of the East India Company; after the company collapsed, the islands were ceded to Britain and became part of its Straits Settlements in 1826."
# ]

【讨论】：

【解决方案4】：

你可以使用这个正则表达式

(?<=^|[.?!])(.*?[Ii]sland.*?(?:[.?!]|$))

Rubular Demo

Ruby 代码

print str.scan(/(?<=^|[.?!])(.*?[Ii]sland.*?(?:[.?!]|$))/)

Ideone Demo

【讨论】：

【解决方案5】：

我可能会不使用正则表达式。当您稍后回到代码时，它们很难阅读和理解。一个简单的拆分成句子，然后根据关键字进行选择应该可以：

input.split('.').select do |sentence|
  sentence.downcase.include?('island')
end

当然可能还有其他的'.'在不用于分隔句子的段落中。

【讨论】：

这将删除句子后面的句点。
true:-) 不过应该很容易再次添加它们。
如果一个句子中包含“islander”这个词怎么办？我可能会没有Ruby。当您稍后回到代码时，Ruby 很难阅读和理解。不？如果你已经获得了一定的语言能力，这对任何语言都是如此，包括正则表达式。