【问题标题】:Labeled LDA learn in Stanford Topic Modeling Toolbox标记的 LDA 在斯坦福主题建模工具箱中学习
【发布时间】:2014-05-22 17:21:07
【问题描述】:

如下运行example-6-llda-learn.scala就ok了:

val source = CSVFile("pubmed-oa-subset.csv") ~> IDColumn(1);

val tokenizer = {
  SimpleEnglishTokenizer() ~>            // tokenize on space and punctuation
  CaseFolder() ~>                        // lowercase everything
  WordsAndNumbersOnlyFilter() ~>         // ignore non-words and non-numbers
  MinimumLengthFilter(3)                 // take terms with >=3 characters
}

val text = {
  source ~>                              // read from the source file
  Column(4) ~>                           // select column containing text
  TokenizeWith(tokenizer) ~>             // tokenize with tokenizer above
  TermCounter() ~>                       // collect counts (needed below)
  TermMinimumDocumentCountFilter(4) ~>   // filter terms in <4 docs
  TermDynamicStopListFilter(30) ~>       // filter out 30 most common terms
  DocumentMinimumLengthFilter(5)         // take only docs with >=5 terms
}

// define fields from the dataset we are going to slice against
val labels = {
  source ~>                              // read from the source file
  Column(2) ~>                           // take column two, the year
  TokenizeWith(WhitespaceTokenizer()) ~> // turns label field into an array
  TermCounter() ~>                       // collect label counts
  TermMinimumDocumentCountFilter(10)     // filter labels in < 10 docs
}

val dataset = LabeledLDADataset(text, labels);

// define the model parameters
val modelParams = LabeledLDAModelParams(dataset);

// Name of the output model folder to generate
val modelPath = file("llda-cvb0-"+dataset.signature+"-"+modelParams.signature);

// Trains the model, writing to the given output path
TrainCVB0LabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1000);
// or could use TrainGibbsLabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1500);

但是当我将最后一行更改为: TrainCVB0LabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1000); 至: TrainGibbsLabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1500);

而CVB0的方法消耗内存很大。我训练了一个10000个文档的语料库,每个文档大约有10个标签,它会消耗30G内存。

【问题讨论】:

  • 我认为这是一个错误,因为有一个java.lang.ArrayIndexOutOfBoundsException

标签: stanford-nlp topic-modeling


【解决方案1】:

我也遇到过同样的情况,我确实认为这是一个错误。在src/main/scala 文件夹下的edu.stanford.nlp.tmt.model.llda 中检查GIbbsLabeledLDA.scala,从第204 行开始:

val z = doc.labels(zI);

val pZ = (doc.theta(z)+topicSmoothing(z)) *
         (countTopicTerm(z)(term)+termSmooth) /
         (countTopic(z)+termSmoothDenom);

doc.labels 是不言自明的,doc.theta 记录了其标签的分布(计数,实际上),其大小与doc.labels 相同。

zI 是迭代doc.labels 的索引变量,而值z 获取实际的标签编号。问题来了:这个文档可能只有一个标签 - 比如 1000 - 因此zI 是 0 而z 是 1000,然后 doc.theta(z) 超出范围。

我想解决方案是将doc.theta(z)修改为doc.theta(zI)
(我正在尝试检查结果是否有意义,无论如何这个错误让我对这个工具箱没有那么自信。)

【讨论】:

    猜你喜欢
    • 2014-05-10
    • 2012-04-30
    • 1970-01-01
    • 2014-09-22
    • 2012-12-07
    • 2012-07-26
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多