【发布时间】:2016-07-24 11:25:08
【问题描述】:
我正在尝试在 Wikipedia XML 转储上执行 LDA。在获得原始文本的 RDD 后,我正在创建一个数据框并通过 Tokenizer、StopWords 和 CountVectorizer 管道对其进行转换。我打算将向量输出的 RDD 从 CountVectorizer 传递到 MLLib 中的 OnlineLDA。 这是我的代码:
// Configure an ML pipeline
RegexTokenizer tokenizer = new RegexTokenizer()
.setInputCol("text")
.setOutputCol("words");
StopWordsRemover remover = new StopWordsRemover()
.setInputCol("words")
.setOutputCol("filtered");
CountVectorizer cv = new CountVectorizer()
.setVocabSize(vocabSize)
.setInputCol("filtered")
.setOutputCol("features");
Pipeline pipeline = new Pipeline()
.setStages(new PipelineStage[] {tokenizer, remover, cv});
// Fit the pipeline to train documents.
PipelineModel model = pipeline.fit(fileDF);
JavaRDD<Vector> countVectors = model.transform(fileDF)
.select("features").toJavaRDD()
.map(new Function<Row, Vector>() {
public Vector call(Row row) throws Exception {
Object[] arr = row.getList(0).toArray();
double[] features = new double[arr.length];
int i = 0;
for(Object obj : arr){
features[i++] = (double)obj;
}
return Vectors.dense(features);
}
});
因为这条线,我得到了类转换异常
Object[] arr = row.getList(0).toArray();
Caused by: java.lang.ClassCastException: org.apache.spark.mllib.linalg.SparseVector cannot be cast to scala.collection.Seq
at org.apache.spark.sql.Row$class.getSeq(Row.scala:278)
at org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:192)
at org.apache.spark.sql.Row$class.getList(Row.scala:286)
at org.apache.spark.sql.catalyst.expressions.GenericRow.getList(rows.scala:192)
at xmlProcess.ParseXML$2.call(ParseXML.java:142)
at xmlProcess.ParseXML$2.call(ParseXML.java:1)
我找到了执行此操作的 Scala 语法here,但找不到任何在 Java 中执行此操作的示例。我试过 row.getAs[Vector](0) 但这只是 Scala 语法。有什么方法可以在 Java 中做到这一点?
【问题讨论】:
标签: java apache-spark spark-dataframe apache-spark-mllib