【问题标题】:Text summarization in R languageR语言中的文本摘要
【发布时间】:2016-07-04 01:27:12
【问题描述】:

我有很长的文本文件,使用R language 的帮助我想用至少 10 到 20 行或小句子总结文本。 如何用R language总结至少10行的文字?

【问题讨论】:

  • 请发布一些示例数据并显示您想要的输出。
  • 今天开课的是文本挖掘的吗?这里有一群新偷窥者提出了这样可怕的问题。

标签: r text text-mining summarization


【解决方案1】:

有一个名为 lexRankr 的包,它以与 Reddit 的 /u/autotldr 机器人总结文章相同的方式总结文本。 This article 有一个关于如何使用它的完整演练,但这只是一个简单的例子,所以你可以在 R 中自己测试它:

#load needed packages
library(xml2)
library(rvest)
library(lexRankr)

#url to scrape
monsanto_url = "https://www.theguardian.com/environment/2017/sep/28/monsanto-banned-from-european-parliament"

#read page html
page = xml2::read_html(monsanto_url)
#extract text from page html using selector
page_text = rvest::html_text(rvest::html_nodes(page, ".js-article__body p"))

#perform lexrank for top 3 sentences
top_3 = lexRankr::lexRank(page_text,
                          #only 1 article; repeat same docid for all of input vector
                          docId = rep(1, length(page_text)),
                          #return 3 sentences to mimick /u/autotldr's output
                          n = 3,
                          continuous = TRUE)

#reorder the top 3 sentences to be in order of appearance in article
order_of_appearance = order(as.integer(gsub("_","",top_3$sentenceId)))
#extract sentences in order of appearance
ordered_top_3 = top_3[order_of_appearance, "sentence"]

> ordered_top_3
[1] "Monsanto lobbyists have been banned from entering the European parliament after the multinational refused to attend a parliamentary hearing into allegations of regulatory interference."
[2] "Monsanto officials will now be unable to meet MEPs, attend committee meetings or use digital resources on parliament premises in Brussels or Strasbourg."                                
[3] "A Monsanto letter to MEPs seen by the Guardian said that the European parliament was not “an appropriate forum” for discussion on the issues involved."  

【讨论】:

    【解决方案2】:

    你可以试试这个(来自LSAfun 包):

    genericSummary(D,k=1)
    

    其中“D”指定您的文本文档,“k”指定摘要中要使用的句子数。 (进一步的修改显示在包文档中)。

    更多信息: http://search.r-project.org/library/LSAfun/html/genericSummary.html

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2011-12-26
      • 2014-01-21
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-06-27
      相关资源
      最近更新 更多