【问题标题】:Spark 1.5 MlLib LDA - getting topic distribusions for new documentsSpark 1.5 MlLib LDA - 获取新文档的主题分布
【发布时间】:2016-09-03 23:36:33
【问题描述】:

不是this 的重复项,因为我问的是输入是什么,而不是调用什么函数,见下文

我跟随this guide 在 Spark 1.5 中创建了一个 LDA 模型。我在this question 中看到,要获取新文档的主题分布,我需要使用LocalLDAModel 的topicDistributions 函数,该函数采用RDD[(Long, Vector)]。

新的文档向量应该是术语计数向量吗?这是 LDA 训练时使用的向量类型。我的代码可以编译并运行,但我想知道这是否是 topicDistributions 函数的预期用途

import org.apache.spark.rdd._
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel, LocalLDAModel}
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import scala.collection.mutable

val input = Seq("this is a document","this could be another document","these are training, not tests", "here is the final file (document)")
val corpus: RDD[Array[String]] = sc.parallelize(input.map{ 
  doc => doc.split("\\s")
})

val termCounts: Array[(String, Long)] = corpus.flatMap(_.map(_ -> 1L)).reduceByKey(_ + _).collect().sortBy(-_._2)

val vocabArray: Array[String] = termCounts.takeRight(termCounts.size).map(_._1)
val vocab: Map[String, Int] = vocabArray.zipWithIndex.toMap

// Convert documents into term count vectors
val documents: RDD[(Long, Vector)] =
    corpus.zipWithIndex.map { case (tokens, id) =>
        val counts = new mutable.HashMap[Int, Double]()
        tokens.foreach { term =>
            if (vocab.contains(term)) {
                val idx = vocab(term)
                counts(idx) = counts.getOrElse(idx, 0.0) + 1.0
            }
        }
        (id, Vectors.sparse(vocab.size, counts.toSeq))
    }
// Set LDA parameters
val numTopics = 10
val ldaModel: DistributedLDAModel = new LDA().setK(numTopics).setMaxIterations(20).run(documents).asInstanceOf[DistributedLDAModel]

//create test input, convert to term count, and get its topic distribution
val test_input = Seq("this is my test document")
val test_document:RDD[(Long,Vector)] = sc.parallelize(test_input.map(doc=>doc.split("\\s"))).zipWithIndex.map{ case (tokens, id) =>
    val counts = new mutable.HashMap[Int, Double]()
    tokens.foreach { term =>
    if (vocab.contains(term)) {
        val idx = vocab(term)
        counts(idx) = counts.getOrElse(idx, 0.0) + 1.0
        }
    }
    (id, Vectors.sparse(vocab.size, counts.toSeq))
}
println("test_document: "+test_document.first._2.toArray.mkString(", "))

val localLDAModel: LocalLDAModel = ldaModel.toLocal
val topicDistributions = localLDAModel.topicDistributions(documents)
println("first topic distribution:"+topicDistributions.first._2.toArray.mkString(", "))

【问题讨论】:

    标签: scala apache-spark apache-spark-mllib lda


    【解决方案1】:

    根据Spark src,我注意到以下关于文档参数的评论:

       * @param documents:  
       * RDD of documents, which are term (word) count vectors paired with IDs.
       * The term count vectors are "bags of words" with a fixed-size vocabulary
       * (where the vocabulary size is the length of the vector).
       * This must use the same vocabulary (ordering of term counts) as in training.
       * Document IDs must be unique and >= 0.
    

    所以答案是肯定的,新的文档向量应该是一个词条计数向量。此外,向量排序应该与训练中使用的相同。

    【讨论】:

      猜你喜欢
      • 2017-02-02
      • 2017-12-31
      • 2016-02-03
      • 1970-01-01
      • 2020-07-07
      • 2022-01-14
      • 2017-12-29
      • 2015-11-08
      相关资源
      最近更新 更多