【问题标题】:How to vectorize text file in mahout?如何在 mahout 中矢量化文本文件?
【发布时间】:2013-03-10 12:59:42
【问题描述】:

我有一个带有标签和推文的文本文件。

    positive,I love this car
    negative,I hate this book
    positive,Good product.

我需要将每一行转换为向量值。如果我使用seq2sparse 命令意味着整个文档被转换为向量,但我需要将每一行转换为向量而不是整个文档。 前任 : 键:正值:向量值(推文) 我们如何在 mahout 中实现这一点?


/* 这是我所做的 */

    StringTokenizer str= new StringTokenizer(line,",");
            String label=str.nextToken();
            while (str.hasMoreTokens())
            {
            tweetline =str.nextToken();
            System.out.println("Tweetline"+tweetline);
            StringTokenizer words = new StringTokenizer(tweetline," ");
            while(words.hasMoreTokens()){
            featureList.add(words.nextToken());}
            }
            Vector unclassifiedInstanceVector = new RandomAccessSparseVector(tweetline.split(" ").length);
 FeatureVectorEncoder vectorEncoder = new AdaptiveWordValueEncoder(label);
            vectorEncoder.setProbes(1);
            System.out.println("Feature List: "+featureList);
            for (Object feature: featureList) {
                vectorEncoder.addToVector((String) feature, unclassifiedInstanceVector);
            }
            context.write(new Text("/"+label), new VectorWritable(unclassifiedInstanceVector));

提前致谢

【问题讨论】:

    标签: java vectorization mahout bigdata


    【解决方案1】:

    您可以使用 SequenceFile.Writer 将其写入应用程序 hdfs 路径

               FS = FileSystem.get(HBaseConfiguration.create());
               String newPath =   "/foo/mahouttest/part-r-00000";
               Path newPathFile = new Path(newPath);
               Text key = new Text();
               VectorWritable value = new VectorWritable();
               SequenceFile.Writer writer = SequenceFile.createWriter(FS, conf, newPathFile,
                    key.getClass(), value.getClass());
                     .....
               key.set("c/"+label);
               value.set(unclassifiedInstanceVector );
               writer.append(key,value);
    

    【讨论】:

      猜你喜欢
      • 2012-08-28
      • 1970-01-01
      • 2012-08-09
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2012-08-07
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多