【发布时间】:2014-05-22 17:21:07
【问题描述】:
如下运行example-6-llda-learn.scala就ok了:
val source = CSVFile("pubmed-oa-subset.csv") ~> IDColumn(1);
val tokenizer = {
SimpleEnglishTokenizer() ~> // tokenize on space and punctuation
CaseFolder() ~> // lowercase everything
WordsAndNumbersOnlyFilter() ~> // ignore non-words and non-numbers
MinimumLengthFilter(3) // take terms with >=3 characters
}
val text = {
source ~> // read from the source file
Column(4) ~> // select column containing text
TokenizeWith(tokenizer) ~> // tokenize with tokenizer above
TermCounter() ~> // collect counts (needed below)
TermMinimumDocumentCountFilter(4) ~> // filter terms in <4 docs
TermDynamicStopListFilter(30) ~> // filter out 30 most common terms
DocumentMinimumLengthFilter(5) // take only docs with >=5 terms
}
// define fields from the dataset we are going to slice against
val labels = {
source ~> // read from the source file
Column(2) ~> // take column two, the year
TokenizeWith(WhitespaceTokenizer()) ~> // turns label field into an array
TermCounter() ~> // collect label counts
TermMinimumDocumentCountFilter(10) // filter labels in < 10 docs
}
val dataset = LabeledLDADataset(text, labels);
// define the model parameters
val modelParams = LabeledLDAModelParams(dataset);
// Name of the output model folder to generate
val modelPath = file("llda-cvb0-"+dataset.signature+"-"+modelParams.signature);
// Trains the model, writing to the given output path
TrainCVB0LabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1000);
// or could use TrainGibbsLabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1500);
但是当我将最后一行更改为: TrainCVB0LabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1000); 至: TrainGibbsLabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1500);
而CVB0的方法消耗内存很大。我训练了一个10000个文档的语料库,每个文档大约有10个标签,它会消耗30G内存。
【问题讨论】:
-
我认为这是一个错误,因为有一个
java.lang.ArrayIndexOutOfBoundsException
标签: stanford-nlp topic-modeling