如何在 Scala 中提取二元组和三元组？答案

【问题标题】：How to extract bigrams and trigrams in Scala?如何在 Scala 中提取二元组和三元组？
【发布时间】：2015-10-08 17:41:32
【问题描述】：

假设这些是我的文件：

very pleased product . phone lightweight comfortable sound quality good house yard . 
quality construction phone base unit good . ample supply cable adapter . plug computer soundcard .
shop unit mail rebate . unit battery pack hold play time strap carr headphone adapter cable perfect digital copy optical. component micro plug stereo connector cable micro plug rca cable . 
unit primarily record guitar jam session . input plug provide power plug microphone . decent stereo mic need digital recording performance . mono mode double recording time .
admit like new electronic toy . digital camera not impress .

我想从每个文档中的每个句子中提取所有二元组和三元组及其出现次数。

我试过了：

case class trigram(first: String, second: String,third: String) {
  def mkReplacement(s: String) = s.replaceAll(first + " " + second + " " + third, first + "-" + second + "-" + third)
}

def stringToTrigrams(s: String) = {
  val words = s.split(".")
  if (words.size >= 3) {
    words.sliding(3).map(a => tigram(a(0),a(1),a(2)))
  }
  else
    Iterator[tigram]()
}

val conf = new SparkConf()
val sc = new SparkContext(conf)
val data = sc.textFile("docs")

val trigrams = data.flatMap {
  stringToTrigrams
}.collect()

val trigramCounts = trigrams.groupBy(identity).mapValues(_.size)

但它没有显示任何三元组？

【问题讨论】：

虽然很高兴看到我的代码被重用，但如果能得到确认，那就太好了 (stackoverflow.com/a/30681833/21755)

标签： scala n-gram

【解决方案1】：

 def stringToTrigrams(s: String) = {
  val words = s.split(".")
  if (words.size >= 3) {
    words.sliding(3).map(a => trigram(a(0),a(1),a(2)))
  } else Iterator[trigram]()
}

IIUC，这个函数是把上面的整个文档，然后在“.”上分割文档。这是你的第一个问题。调用 split(".") 不会做你认为它做的事情。您实际上是在通配符而不是“。”上进行拆分。像你要的那样。将此更改为“\”。你会将文档拆分成句子。

完成后，我们需要通过简单地在空格上拆分来将句子拆分为单词，我建议通过 _.split(\\s+) 来拆分所有空格。现在您应该能够使用如下函数解析单词并创建三元组：

def stringToTrigrams(s: String) = {
  val sentences = s.split("\\.")
  sentences flatMap { sent => 
    val words = sent.split("\\s+").filter(_ != "")
    if (words.length >= 3)
      words.sliding(3).map(a => trigram(a(0), a(1), a(2))
    else Iterator[trigram]
  }
}

希望这会有所帮助。

【讨论】：