【发布时间】:2017-01-24 15:22:40
【问题描述】:
我是 Apache Spark 的新手,正在尝试将 .csv 文件中的数据转换为 LabeledPoint 以使用 Apache Spark 的 MLlib 包。我尝试使用以下代码获取 LabeledPoint Data RDD,但事实证明它是 ML Package 的 LabeledPoint Data。现在我想创建正确的 MLlib 包的 LabeledPoint 数据。谁能帮忙。
private static String appName = "learning_RDD";
private static String master = "spark://23.195.26.187:7077" ;
static SparkConf sparkConf = new SparkConf().setMaster("local[1]").setAppName("MLPipelineSample").set("spark.driver.memory", "512m").set("spark.sql.warehouse.dir","D:\\input.txt");
static SparkContext sc = new SparkContext(sparkConf);
static SparkSession spark = SparkSession
.builder().sparkContext(sc)
.getOrCreate();
public static void main(String args[]) throws IOException {
Dataset<Row> trainingData = spark.read().format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("D:\\abc\\Spark\\WebcontentClassification_UsingSparkML\\WebcontentClassification_UsingSparkML\\NaiveBayes_ML_20ErrorRate\\nutchcsvalldata.csv");
Tokenizer tokenizer = new Tokenizer().setInputCol("content").setOutputCol("words");
Dataset<Row> words = tokenizer.transform(trainingData);
StopWordsRemover remover = new StopWordsRemover().setInputCol("words").setOutputCol("filteredwords");
Dataset<Row> filteredwords = remover.transform(words);
HashingTF hashingTF = new HashingTF().setNumFeatures(1000).setInputCol("filteredwords").setOutputCol("rawfeatures");
Dataset<Row> hashedtf_Vector = hashingTF.transform(filteredwords);
IDF idf = new IDF().setInputCol("rawfeatures").setOutputCol("features");
IDFModel idfModel = idf.fit(hashedtf_Vector);
Dataset<Row> Vectors = idfModel.transform(hashedtf_Vector);
Iterator<Row> iterator = Vectors.toLocalIterator();
List<LabeledPointLabeledPoint> labeledpoints = new ArrayList<LabeledPoint>();
while(iterator.hasNext())
{
Row r = iterator.next();
int label = r.getAs(2);
Vector v = r.getAs(16);
LabeledPoint labeledpoint = new LabeledPoint(label, v);
labeledpoints.add(labeledpoints);
}
// Here I am suppose convert the List into RDD<LabeledPoint> and use SVM Algorithm
}
【问题讨论】:
-
你应该使用
RDD的LabeledPoint,而不是List -
您似乎也遵循了一些示例,该示例试图教您有关管道的知识(tokenizer、stopWordRemover、HashingTF、IDF)。我的假设是否正确?
-
是的,你是对的,我打算从 List 中创建一个 RDD(如果你阅读了最后一行代码中的注释)
标签: java apache-spark